#187 Repeated DNA Sequences - Solution

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]

Constraints:

1 <= s.length <= 10⁵
s[i] is either 'A', 'C', 'G', or 'T'.

The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

For example, "ACGAATTCCG" is a DNA sequence.

When studying DNA, it is useful to identify repeated sequences within the DNA.

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]

Constraints:

1 <= s.length <= 10⁵
s[i] is either 'A', 'C', 'G', or 'T'.

Approach	Time Complexity	Space Complexity
Sliding Window with Hash Set	O(n)	O(n)
Rolling Hash / Bit Manipulation Encoding	O(n)	O(n)

Solutions (4)

Sliding Window with Hash Map

This approach utilizes a sliding window technique combined with a hash map (or dictionary) to identify 10-letter-long sequences that appear more than once. You slide the window across the string and keep track of all sequences in a hash map. If a sequence is found more than once, it's added to the result list.

Time Complexity: O(n), where n is the length of the string, due to the single pass over the string.
Space Complexity: O(n), as we store sequences in sets.

Python JavaScript

1function findRepeatedDnaSequences(s) {
2    const seen = new Set();
3    const repeated = new Set();
4    for (let i = 0; i <= s.length - 10; i++) {
5        const seq = s.substring(i, i + 10);
6        if (seen.has(seq)) {
7            repeated.add(seq);
8        } else {
9            seen.add(seq);
10        }
11    }
12    return Array.from(repeated);
13}

Explanation

This JavaScript solution is similar to the Python approach, using JavaScript's Set to store seen and repeated sequences. It iterates through the string, adds new sequences to the 'seen' set, and tracks sequences that repeat.

Bit Manipulation

This approach utilizes bit manipulation to represent sequences as numbers, which can help optimize space usage when compared to storing strings directly. Each nucleotide is encoded as 2 bits ('A' = 00, 'C' = 01, 'G' = 10, 'T' = 11), allowing the entire 10-letter sequence to fit within 20 bits.

Time Complexity: O(n), as each character in the string is processed once.
Space Complexity: O(4^10), due to fixed encoding size for sequences.

C C++

1#include <vector>
2#include <string>
3#include <unordered_set>
4
5std::vector<std::string> findRepeatedDnaSequences(std::string s) {
    std::unordered_set<int> seen, repeated;
    std::vector<std::string> result;
    int map[128] = {0};
    map['A'] = 0;
    map['C'] = 1;
    map['G'] = 2;
    map['T'] = 3;

    int num = 0;
    for (int i = 0; i < s.size(); ++i) {
        num = ((num << 2) & 0xFFFFF) | map[s[i]];
        if (i < 9) continue;
        if (seen.count(num) && !repeated.count(num)) {
            result.push_back(s.substr(i - 9, 10));
            repeated.insert(num);
        }
        seen.insert(num);
    }
    return result;
}