The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
"ACGAATTCCG" is a DNA sequence.When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 105s[i] is either 'A', 'C', 'G', or 'T'.Problem Overview: Given a DNA string containing characters A, C, G, and T, find every 10-letter-long sequence that appears more than once. The challenge is efficiently detecting repeated substrings of fixed length without comparing every possible pair.
Approach 1: Sliding Window with Hash Map (O(n) time, O(n) space)
Use a sliding window of length 10 and scan the string from left to right. For each index, extract the substring s[i:i+10] and store it in a hash table. The hash map tracks how many times each sequence appears. When a sequence’s count becomes 2, add it to the result list. Each substring lookup and update is an average O(1) hash operation, so the overall scan stays linear.
This approach is straightforward and easy to implement in most languages. The key insight is that the window size is fixed, so each position generates exactly one candidate substring. Hashing avoids expensive pairwise comparisons between sequences. The downside is memory usage: storing many 10-character strings increases space consumption, especially when the DNA string is large.
Approach 2: Bit Manipulation with Rolling Encoding (O(n) time, O(n) space)
DNA characters come from a set of four symbols, which means each can be encoded using 2 bits. Map A=00, C=01, G=10, and T=11. A 10-character sequence therefore fits into a 20-bit integer. Instead of storing substrings, maintain a rolling integer hash. Shift the current value left by 2 bits, add the new character’s encoding, and mask the result to keep only the latest 20 bits.
This technique acts like a compact bit manipulation rolling hash. Store each encoded sequence in a hash set to track whether it has appeared before. When the same 20-bit pattern appears again, record the corresponding substring as a repeated sequence. The rolling update makes each step O(1) while reducing memory compared to storing full strings.
Recommended for interviews: Start by explaining the sliding window with a hash map. It clearly demonstrates understanding of substring scanning and hash-based duplicate detection. After that, propose the bit manipulation optimization. Interviewers often look for this encoding trick because it reduces memory and demonstrates knowledge of rolling hashes and compact representations.
This approach utilizes a sliding window technique combined with a hash map (or dictionary) to identify 10-letter-long sequences that appear more than once. You slide the window across the string and keep track of all sequences in a hash map. If a sequence is found more than once, it's added to the result list.
This Python solution uses a set to track seen sequences and another set to track repeated sequences. It iterates over the string and extracts each 10-letter sequence using slicing. If a sequence has been seen before, it's added to the 'repeated' set.
Python
JavaScript
Time Complexity: O(n), where n is the length of the string, due to the single pass over the string.
Space Complexity: O(n), as we store sequences in sets.
This approach utilizes bit manipulation to represent sequences as numbers, which can help optimize space usage when compared to storing strings directly. Each nucleotide is encoded as 2 bits ('A' = 00, 'C' = 01, 'G' = 10, 'T' = 11), allowing the entire 10-letter sequence to fit within 20 bits.
This C solution uses bit manipulation to encode 10-letter sequences as 20-bit integers. It uses arrays to track sequences seen and repeated. This allows for efficient membership checking and space usage.
Time Complexity: O(n), as each character in the string is processed once.
Space Complexity: O(4^10), due to fixed encoding size for sequences.
We define a hash table cnt to store the occurrence count of all substrings of length 10.
We iterate through all substrings of length 10 in the string s. For the current substring t, we update its count in the hash table. If the count of t is 2, we add it to the answer.
After the iteration, we return the answer array.
The time complexity is O(n times 10), and the space complexity is O(n times 10). Here, n is the length of the string s.
Python
Java
C++
Go
TypeScript
Rust
JavaScript
C#
This method essentially combines sliding window and hash. Similar to 0028. Find the Index of the First Occurrence in a String, this problem can use a hash function to reduce the time complexity of counting subsequences to O(1).
The time complexity is O(n), and the space complexity is O(n). Here, n is the length of the string s.
Go
| Approach | Complexity |
|---|---|
| Sliding Window with Hash Map | Time Complexity: O(n), where n is the length of the string, due to the single pass over the string. |
| Bit Manipulation | Time Complexity: O(n), as each character in the string is processed once. |
| Hash Table | — |
| Rabin-Karp String Matching Algorithm | — |
| Approach | Time | Space | When to Use |
|---|---|---|---|
| Sliding Window with Hash Map | O(n) | O(n) | Best for readability and quick implementation during interviews |
| Bit Manipulation Rolling Hash | O(n) | O(n) | When optimizing memory and demonstrating advanced hashing techniques |
Repeated DNA Sequences - Leetcode 187 - Python • NeetCode • 43,860 views views
Watch 9 more video solutions →Practice Repeated DNA Sequences with our built-in code editor and test cases.
Practice on FleetCode