#609 Find Duplicate File in System - Solution

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example 2:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Constraints:

1 <= paths.length <= 2 * 10⁴
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 10⁵
paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Follow up:

Imagine you are given a real file system, how will you search files? DFS or BFS?
If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example 2:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Constraints:

1 <= paths.length <= 2 * 10⁴
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 10⁵
paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Follow up:

Imagine you are given a real file system, how will you search files? DFS or BFS?
If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

In #609 Find Duplicate File in System, each directory string contains file names and their contents in the format filename(content). The key idea is to identify files that share the same content. Instead of comparing every file with every other file (which would be inefficient), we can group files by their content.

A practical approach is to use a hash table where the key is the file content and the value is a list of full file paths containing that content. Parse each directory string, extract the directory path, file names, and their contents, and build the full file path for each file. Insert the path into the list mapped to its content.

After processing all files, iterate through the hash table and collect groups whose list size is greater than one. These groups represent duplicate files. The approach runs in roughly O(N × L) time where N is the number of files and L is the average string length, with additional space used for the hash map.

Approach	Time Complexity	Space Complexity
Hash Map Grouping by File Content	O(N × L)	O(N × L)

To find duplicate files by their content, we can use a HashMap (or Dictionary). For each directory info string, parse the directory path and the files with their contents. Use the content as the key in the map and store full path of the file as its value. After parsing all inputs, the map keys with more than one value represent duplicate files.

Time Complexity: O(n), where n is the total number of characters in all file paths. We iterate over each character once.
Space Complexity: O(n) to store the lists of file paths in the dictionary.

In Java, we use a HashMap to collect file paths by their content. Using the split method, we separate the root directory from the files. In a loop, the file name and contents are isolated, enabling us to map the full path of each file under its content as a key. Finally, groups with more than one entry are added to the result list and returned.

This approach is based on directly processing string data and arranging results using a 2D array. Strings are manipulated to extract directory data, file names, and contents into standalone variables, then append paths to a growing structure. Compared to hash maps, this method uses arrays to aggregate identical files.

Time Complexity: O(n), where n is the total input character count due to one-pass evaluation.
Space Complexity: O(n), maintaining paths and intermediate arrays.

Algorithms Made Easy

5:273,943 views

By substituting hash maps with straightforward string operations and arrays in Python, this solution mimics hash map behavior to achieve equivalent outcomes. It uses slicing to extract filenames and contents and builds a result by tracking paths directly through array indexing.

609. Find Duplicate File in System

Problem Statement

609. Find Duplicate File in System

Problem Statement

Approach

Complexity

Video Solution Available

Solutions (10)

Using a HashMap or Dictionary

Explanation

Using Direct String Manipulation and 2D Array

Video Solutions

Find Duplicate File in System | Live Coding with Explanation | Leetcode - 609

LeetCode 1948. Delete Duplicate Folders in System

Leetcode - Find Duplicate File in System (Python)

Find Duplicate File in System | Leetcode 609 | Live coding session💡

609. Find Duplicate File in System (Leetcode Medium)

leetcode 609. Find Duplicate File in System

LeetCode - 609. Find Duplicate File in System | Day 18 May Challenge

Amazon Coding Interview Question | Leetcode 609 | Find Duplicate File in System

Find Duplicate File in System | LeetCode 609 | Coders Camp

Day 42: Solving LeetCode Coding Problems | Check If Two String Arrays Are Equivalent | 150 Days

Asked By Companies

Prepare for Interviews

Notes

Personal Notes

Similar Problems

Related Topics

Problem Stats

Practice on LeetCode

Frequently Asked Questions

Is Find Duplicate File in System asked in FAANG interviews?

What data structure is best for Find Duplicate File in System?

What is the optimal approach for Find Duplicate File in System?

Why do we group files by content instead of comparing file names?

Explanation