#609 Find Duplicate File in System - Solution

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example 2:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Constraints:

1 <= paths.length <= 2 * 10⁴
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 10⁵
paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Follow up:

Imagine you are given a real file system, how will you search files? DFS or BFS?
If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

In #609 Find Duplicate File in System, each directory string contains file names and their contents in the format filename(content). The key idea is to identify files that share the same content. Instead of comparing every file with every other file (which would be inefficient), we can group files by their content.

A practical approach is to use a hash table where the key is the file content and the value is a list of full file paths containing that content. Parse each directory string, extract the directory path, file names, and their contents, and build the full file path for each file. Insert the path into the list mapped to its content.

After processing all files, iterate through the hash table and collect groups whose list size is greater than one. These groups represent duplicate files. The approach runs in roughly O(N × L) time where N is the number of files and L is the average string length, with additional space used for the hash map.

Approach	Time Complexity	Space Complexity
Hash Map Grouping by File Content	O(N × L)	O(N × L)

Solutions (10)

Using a HashMap or Dictionary

To find duplicate files by their content, we can use a HashMap (or Dictionary). For each directory info string, parse the directory path and the files with their contents. Use the content as the key in the map and store full path of the file as its value. After parsing all inputs, the map keys with more than one value represent duplicate files.

Time Complexity: O(n), where n is the total number of characters in all file paths. We iterate over each character once.
Space Complexity: O(n) to store the lists of file paths in the dictionary.

Python C++Java JavaScript C C#

1using System;
2using System.Collections.Generic;
3
4public class DuplicateFiles
5{
6    public IList<IList<string>> FindDuplicate(string[] paths)
7    {
8        var contentMap = new Dictionary<string, List<string>>();
9        foreach (var path in paths)
10        {
11            var parts = path.Split(' ');
12            var root = parts[0];
13            for (int i = 1; i < parts.Length; ++i)
14            {
                var nameContent = parts[i].Split('(');
                var name = nameContent[0];
                var content = nameContent[1].Substring(0, nameContent[1].Length - 1);
                if (!contentMap.ContainsKey(content))
                {
                    contentMap[content] = new List<string>();
                }
                contentMap[content].Add(root + "/" + name);
            }
        }
        var result = new List<IList<string>>();
        foreach (var group in contentMap.Values)
        {
            if (group.Count > 1)
            {
                result.Add(group);
            }
        }
        return result;
    }
}

Explanation

In C#, the Dictionary class facilitates mapping file content to its corresponding paths. The split method aids in processing directories and files, allowing us to accurately extract and group paths by content key. The final step involves filtering and returning only those entries which have duplicate file path lists.

Using Direct String Manipulation and 2D Array

This approach is based on directly processing string data and arranging results using a 2D array. Strings are manipulated to extract directory data, file names, and contents into standalone variables, then append paths to a growing structure. Compared to hash maps, this method uses arrays to aggregate identical files.

Time Complexity: O(n), where n is the total input character count due to one-pass evaluation.
Space Complexity: O(n), maintaining paths and intermediate arrays.

Python C++Java C#