The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
"ACGAATTCCG" is a DNA sequence.When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 105s[i] is either 'A', 'C', 'G', or 'T'.Problem Overview: Given a DNA string s containing characters A, C, G, and T, return all 10-letter-long sequences that appear more than once. The challenge is detecting repeated substrings efficiently without checking every possible pair.
Approach 1: Sliding Window with Hash Map (O(n) time, O(n) space)
Iterate through the string using a fixed-size sliding window of length 10. For every position, extract the substring s[i:i+10] and track its frequency using a hash map or set. The key idea is that each 10-character sequence can be processed once while moving the window forward by one character. When a sequence appears for the second time, add it to the result list. Hash lookups and insertions run in O(1) on average, making the full scan linear.
This approach is straightforward and easy to implement. It works well because the substring size is constant, so extracting and hashing each window remains efficient. The tradeoff is memory usage: storing many substrings can take up to O(n) space. Still, for most interview constraints this method is perfectly acceptable. This technique relies heavily on a hash table combined with a sliding window traversal of the string.
Approach 2: Bit Manipulation (Rolling Hash) (O(n) time, O(n) space)
DNA characters come from only four possibilities, so each can be encoded using 2 bits: A=00, C=01, G=10, T=11. A 10-character sequence therefore fits inside a 20-bit integer. Instead of storing substrings, convert each window into its integer representation and update it using a rolling hash. Shift the current value left by 2 bits, add the new character encoding, and mask the lowest 20 bits to maintain the 10-character window.
Store these encoded values in a set to track sequences you've already seen. When the same 20-bit pattern appears again, record the corresponding substring as a repeat. This avoids repeatedly allocating substring objects and keeps comparisons extremely fast. The algorithm still runs in O(n) time, but it reduces constant factors and memory overhead compared with storing raw strings.
This method demonstrates deeper understanding of bit manipulation and rolling hash techniques. It's especially useful when substring creation is expensive or when memory optimization matters.
Recommended for interviews: Start with the sliding window + hash map explanation because it clearly shows the repeated-substring detection strategy. Then mention the bit-encoding optimization to demonstrate deeper algorithmic thinking. Interviewers typically expect the O(n) sliding window approach, but the rolling hash version shows stronger control over performance and memory.
This approach utilizes a sliding window technique combined with a hash map (or dictionary) to identify 10-letter-long sequences that appear more than once. You slide the window across the string and keep track of all sequences in a hash map. If a sequence is found more than once, it's added to the result list.
This Python solution uses a set to track seen sequences and another set to track repeated sequences. It iterates over the string and extracts each 10-letter sequence using slicing. If a sequence has been seen before, it's added to the 'repeated' set.
JavaScript
Time Complexity: O(n), where n is the length of the string, due to the single pass over the string.
Space Complexity: O(n), as we store sequences in sets.
This approach utilizes bit manipulation to represent sequences as numbers, which can help optimize space usage when compared to storing strings directly. Each nucleotide is encoded as 2 bits ('A' = 00, 'C' = 01, 'G' = 10, 'T' = 11), allowing the entire 10-letter sequence to fit within 20 bits.
This C solution uses bit manipulation to encode 10-letter sequences as 20-bit integers. It uses arrays to track sequences seen and repeated. This allows for efficient membership checking and space usage.
C++
Time Complexity: O(n), as each character in the string is processed once.
Space Complexity: O(4^10), due to fixed encoding size for sequences.
| Approach | Complexity |
|---|---|
| Sliding Window with Hash Map | Time Complexity: O(n), where n is the length of the string, due to the single pass over the string. |
| Bit Manipulation | Time Complexity: O(n), as each character in the string is processed once. |
| Approach | Time | Space | When to Use |
|---|---|---|---|
| Sliding Window with Hash Map | O(n) | O(n) | General solution. Easy to implement and explain in interviews. |
| Bit Manipulation (Rolling Hash) | O(n) | O(n) | When optimizing memory and avoiding substring allocations. |
Repeated DNA Sequences - Leetcode 187 - Python • NeetCode • 37,535 views views
Watch 9 more video solutions →Practice Repeated DNA Sequences with our built-in code editor and test cases.
Practice on FleetCode