Find Duplicate File in System - Solution & Explanation

Q: Is Find Duplicate File in System easy or hard?

Find Duplicate File in System is classified as Medium difficulty on LeetCode. The parsing logic is straightforward, but correctly extracting file content and grouping paths efficiently with a hash map requires careful implementation.

Q: How to solve Find Duplicate File in System in O(n)?

Parse each directory string, extract file names and their contents, and map the content to a list of full file paths using a dictionary. Because each file is processed once and hash lookups are constant time on average, grouping duplicates happens in linear time relative to the number of files, aside from string parsing cost.

Q: What is the best approach for Find Duplicate File in System?

The most efficient approach uses a hash map where the key is the file content and the value is a list of file paths. As you parse each directory string, extract the content and append the full path to the corresponding list. After processing all files, return only the groups with more than one path. This runs in O(n * L) time where L is the average string length.

Q: What data structure is used in Find Duplicate File in System?

A hash table (dictionary or HashMap) is the primary data structure. It maps file content strings to a list of file paths that contain that content, allowing efficient grouping of duplicates during a single pass through the input.

Q: What is the time complexity of Find Duplicate File in System?

The optimal hash map solution runs in O(n * L) time because each file entry is parsed once and inserted into a dictionary with O(1) average lookup. A brute-force comparison approach requires checking every pair of files, leading to O(n² * L) time.

Q: Find Duplicate File in System Python or Java solution approach

Both Python and Java solutions follow the same idea: parse each directory string, extract the file content, and store the full path in a map keyed by content. In Python this is typically implemented with a dictionary of lists, while Java uses a HashMap >.

MediumArray Hash Table String18 min readAsked at: Amazon, Microsoft, Meta +5

Practice this problem

Problem Statement

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

"directory_path/file_name.txt"

Example 1:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example 2:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Constraints:

1 <= paths.length <= 2 * 10⁴
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 10⁵
paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
You may assume no files or directories share the same name in the same directory.
You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

Follow up:

Imagine you are given a real file system, how will you search files? DFS or BFS?
If the file content is very large (GB level), how will you modify your solution?
If you can only read the file by 1kb each time, how will you modify your solution?
What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?
How to make sure the duplicated files you find are not false positive?

Approach Overview

Problem Overview: Each input string represents a directory followed by multiple files and their contents, formatted like "root/a 1.txt(abcd) 2.txt(efgh)". You need to identify files with identical content and return their full paths grouped together. Only groups with more than one file count as duplicates.

Approach 1: Using a HashMap / Dictionary (O(n * L) time, O(n * L) space)

This approach groups files by their content using a hash map. Iterate through each directory string, split it by spaces to separate the directory path from its files, and parse each file entry to extract the file name and the content inside parentheses. Build the full path using the directory and file name, then insert it into a map where the key is the file content and the value is a list of file paths. Because hash lookups are O(1) on average, grouping happens efficiently while scanning the input once.

After processing all entries, iterate through the map and collect only the lists whose size is greater than one. Those represent files that share identical content. The main cost comes from parsing strings and storing them, giving roughly O(n * L) time where L is the average length of each path string. This method relies heavily on fast hashing and string processing, making it a natural application of a hash table combined with careful string parsing.

Approach 2: Direct String Manipulation with 2D Array Comparison (O(n² * L) time, O(n * L) space)

Another way is to first parse all directory strings and store each file as a pair: [fullPath, content] inside a 2D array. Once every file entry is extracted, compare each file's content with every other file using nested loops. When two files share identical content, place their paths into the same duplicate group.

This approach uses only arrays and direct comparisons instead of hashing. The downside is the quadratic comparison step: for n files, the algorithm performs up to n² content checks. Each comparison may scan the entire string content, making the total cost roughly O(n² * L). While simpler conceptually, it becomes slow for large datasets. It mainly demonstrates how duplicate detection works without additional data structures.

Recommended for interviews: The hash map grouping approach is the expected solution. It demonstrates efficient parsing, use of a dictionary for grouping, and linear scanning of the input. Interviewers want to see that you immediately map file content to paths instead of comparing every pair. The brute-force comparison still helps show baseline reasoning, but the hash-based grouping proves you understand how to reduce quadratic work to near-linear time.

Approach 1: Using a HashMap or Dictionary

To find duplicate files by their content, we can use a HashMap (or Dictionary). For each directory info string, parse the directory path and the files with their contents. Use the content as the key in the map and store full path of the file as its value. After parsing all inputs, the map keys with more than one value represent duplicate files.

This Python solution uses defaultdict from the collections module to store lists of file paths with the same content. We split each path string into directory and file information, then iterate over each file's name and content to record them in our dictionary. Finally, we filter the dictionary to include only entries that have more than one file path, returning them as groups of duplicates.

Code

Python C++Java JavaScript C C#

Python

C++

Java

JavaScript

Complexity

Time Complexity: O(n), where n is the total number of characters in all file paths. We iterate over each character once.
Space Complexity: O(n) to store the lists of file paths in the dictionary.

Try this approach in the editor →

Approach 2: Using Direct String Manipulation and 2D Array

This approach is based on directly processing string data and arranging results using a 2D array. Strings are manipulated to extract directory data, file names, and contents into standalone variables, then append paths to a growing structure. Compared to hash maps, this method uses arrays to aggregate identical files.

By substituting hash maps with straightforward string operations and arrays in Python, this solution mimics hash map behavior to achieve equivalent outcomes. It uses slicing to extract filenames and contents and builds a result by tracking paths directly through array indexing.

Code

Python C++Java C#

Python

C++

Java

Complexity

Time Complexity: O(n), where n is the total input character count due to one-pass evaluation.
Space Complexity: O(n), maintaining paths and intermediate arrays.

Try this approach in the editor →

Approach 3: Hash Table

We create a hash table d, where the key is the file content and the value is a list of file paths with the same content.

Next, we iterate over paths. For each path, we split it into the directory path and file information. For each file entry, we extract the file name and file content, and append the file path to the corresponding list in hash table d.

Finally, we return all values in hash table d that have more than one file path.

The time complexity is O(n) and the space complexity is O(n), where n is the length of paths.

Code

Python Java C++Go TypeScript

Python

Java

C++

TypeScript

Try this approach in the editor →

Complexity Comparison

Approach	Complexity
Using a HashMap or Dictionary	Time Complexity: O(n), where n is the total number of characters in all file paths. We iterate over each character once. Space Complexity: O(n) to store the lists of file paths in the dictionary.
Using Direct String Manipulation and 2D Array	Time Complexity: O(n), where n is the total input character count due to one-pass evaluation. Space Complexity: O(n), maintaining paths and intermediate arrays.
Hash Table	—

Detailed Complexity Analysis

Approach	Time	Space	When to Use
HashMap / Dictionary Grouping	O(n * L)	O(n * L)	General case; fastest way to group files by identical content
Direct String Comparison with 2D Array	O(n² * L)	O(n * L)	Educational baseline when avoiding hash tables or demonstrating brute-force grouping

Video Solution

Find Duplicate File in System | Live Coding with Explanation | Leetcode - 609 • Algorithms Made Easy • 4,073 views views

Watch 9 more video solutions →

Frequently Asked Questions

Is Find Duplicate File in System easy or hard?

Find Duplicate File in System is classified as Medium difficulty on LeetCode. The parsing logic is straightforward, but correctly extracting file content and grouping paths efficiently with a hash map requires careful implementation.

How to solve Find Duplicate File in System in O(n)?

Parse each directory string, extract file names and their contents, and map the content to a list of full file paths using a dictionary. Because each file is processed once and hash lookups are constant time on average, grouping duplicates happens in linear time relative to the number of files, aside from string parsing cost.

What is the best approach for Find Duplicate File in System?

The most efficient approach uses a hash map where the key is the file content and the value is a list of file paths. As you parse each directory string, extract the content and append the full path to the corresponding list. After processing all files, return only the groups with more than one path. This runs in O(n * L) time where L is the average string length.

What data structure is used in Find Duplicate File in System?

A hash table (dictionary or HashMap) is the primary data structure. It maps file content strings to a list of file paths that contain that content, allowing efficient grouping of duplicates during a single pass through the input.

What is the time complexity of Find Duplicate File in System?

The optimal hash map solution runs in O(n * L) time because each file entry is parsed once and inserted into a dictionary with O(1) average lookup. A brute-force comparison approach requires checking every pair of files, leading to O(n² * L) time.

Find Duplicate File in System Python or Java solution approach

Both Python and Java solutions follow the same idea: parse each directory string, extract the file content, and store the full path in a map keyed by content. In Python this is typically implemented with a dictionary of lists, while Java uses a HashMap<String, List<String>>.

Is Find Duplicate File in System asked at Google or Amazon interviews?

Variants of duplicate detection and file system parsing appear in interviews at companies like Amazon, Google, and Meta. The problem tests practical skills such as string parsing, hash map grouping, and handling structured input formats.

Ready to solve this problem?

Practice Find Duplicate File in System with our built-in code editor and test cases.

Practice on FleetCode

Delete Duplicate Folders in System

Problem Info

DifficultyMedium

Acceptance67.6%

Approaches3

Reading time18 min

Asked at

Amazon Microsoft Meta Google Turing

Practice this problem

Open in Editor

Find Duplicate File in System - Solution & Explanation

Problem Statement

Approach Overview

Approach 1: Using a HashMap or Dictionary

Code

Complexity

Approach 2: Using Direct String Manipulation and 2D Array

Code

Complexity

Approach 3: Hash Table

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Ready to solve this problem?

Problem Info

Table of Contents

Find Duplicate File in System - Solution & Explanation

Problem Statement

Approach Overview

Approach 1: Using a HashMap or Dictionary

Code

Complexity

Approach 2: Using Direct String Manipulation and 2D Array

Code

Complexity

Approach 3: Hash Table

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Ready to solve this problem?

Problem Info

Table of Contents

Problem Statement

Approach Overview

Approach 1: Using a HashMap or Dictionary

Code

Complexity

Approach 2: Using Direct String Manipulation and 2D Array

Code

Complexity

Approach 3: Hash Table

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Related Problems

Ready to solve this problem?

Problem Info

Table of Contents

Problem Statement

Approach Overview

Approach 1: Using a HashMap or Dictionary

Code

Complexity

Approach 2: Using Direct String Manipulation and 2D Array

Code

Complexity

Approach 3: Hash Table

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Related Problems

Ready to solve this problem?

Problem Info

Table of Contents