The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
"ACGAATTCCG" is a DNA sequence.When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 105s[i] is either 'A', 'C', 'G', or 'T'.The key challenge in Repeated DNA Sequences is efficiently identifying all 10-letter-long substrings that appear more than once in a DNA string. Since the substring length is fixed, a sliding window approach works well. As the window moves across the string, each 10-character substring can be tracked using a hash table or set to record which sequences have already been seen.
A straightforward strategy stores each substring in a hash set and records duplicates when encountered again. However, this can be optimized using rolling hash or bit manipulation. Because DNA characters (A, C, G, T) can be encoded with just two bits, the window can be converted into a compact integer representation, allowing faster comparisons and reduced memory overhead.
The algorithm scans the string once while maintaining the rolling representation of the current window. This leads to an efficient solution with near-linear performance. Time complexity is typically O(n), while space complexity depends on how many unique sequences are stored.
| Approach | Time Complexity | Space Complexity |
|---|---|---|
| Sliding Window with Hash Set | O(n) | O(n) |
| Rolling Hash / Bit Manipulation Encoding | O(n) | O(n) |
NeetCode
This approach utilizes a sliding window technique combined with a hash map (or dictionary) to identify 10-letter-long sequences that appear more than once. You slide the window across the string and keep track of all sequences in a hash map. If a sequence is found more than once, it's added to the result list.
Time Complexity: O(n), where n is the length of the string, due to the single pass over the string.
Space Complexity: O(n), as we store sequences in sets.
1function findRepeatedDnaSequences(s) {
2 const seen = new Set();
3 const repeated = new Set();
4 for (let i = 0; i <= s.length - 10; i++) {
5 const seq = s.substring(i, i + 10);
6 if (seen.has(seq)) {
7 repeated.add(seq);
8 } else {
9 seen.add(seq);
10 }
11 }
12 return Array.from(repeated);
13}This JavaScript solution is similar to the Python approach, using JavaScript's Set to store seen and repeated sequences. It iterates through the string, adds new sequences to the 'seen' set, and tracks sequences that repeat.
This approach utilizes bit manipulation to represent sequences as numbers, which can help optimize space usage when compared to storing strings directly. Each nucleotide is encoded as 2 bits ('A' = 00, 'C' = 01, 'G' = 10, 'T' = 11), allowing the entire 10-letter sequence to fit within 20 bits.
Time Complexity: O(n), as each character in the string is processed once.
Space Complexity: O(4^10), due to fixed encoding size for sequences.
1#include <vector>
2#include <string>
3#include <unordered_set>
4
5std::vector<std::string> findRepeatedDnaSequences(std::string s) {
std::unordered_set<int> seen, repeated;
std::vector<std::string> result;
int map[128] = {0};
map['A'] = 0;
map['C'] = 1;
map['G'] = 2;
map['T'] = 3;
int num = 0;
for (int i = 0; i < s.size(); ++i) {
num = ((num << 2) & 0xFFFFF) | map[s[i]];
if (i < 9) continue;
if (seen.count(num) && !repeated.count(num)) {
result.push_back(s.substr(i - 9, 10));
repeated.insert(num);
}
seen.insert(num);
}
return result;
}Watch expert explanations and walkthroughs
Practice problems asked by these companies to ace your technical interviews.
Explore More ProblemsJot down your thoughts, approach, and key learnings
Yes, variations of this problem appear in coding interviews at large tech companies. It tests knowledge of hashing, sliding window techniques, and optimization using bit manipulation.
Rolling hash allows the algorithm to update the hash of the current 10-character window in constant time as the window slides. This avoids repeatedly computing hashes for entire substrings and improves efficiency.
The optimal approach uses a sliding window combined with hashing or rolling hash. By encoding each 10-letter substring and tracking previously seen sequences, duplicates can be detected efficiently in linear time.
A hash set or hash map is commonly used to store seen DNA substrings. In optimized solutions, bit manipulation encodes each character into two bits, enabling compact storage and faster lookups.
This C++ solution uses unordered sets to track sequences encoded as numbers. Each 10-letter sequence is converted into an integer using bit shifting and masking to ensure only 20 bits are used, allowing for efficient comparison and storage.