DataFrame customers +-------------+--------+ | Column Name | Type | +-------------+--------+ | customer_id | int | | name | object | | email | object | +-------------+--------+
There are some duplicate rows in the DataFrame based on the email column.
Write a solution to remove these duplicate rows and keep only the first occurrence.
The result format is in the following example.
Example 1: Input: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 5 | Finn | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Output: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Explanation: Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.
Problem Overview: You are given a table-like dataset and need to remove duplicate rows while keeping the first occurrence of each unique row. The result should contain only unique rows while preserving the original order of appearance.
Approach 1: Using Built-in Functions (O(n) time, O(n) space)
The most direct solution uses the built-in drop_duplicates() function available in pandas. This method scans the DataFrame and internally tracks previously seen rows using hashing. When a duplicate row appears, it is skipped while the first instance is preserved. Under the hood, pandas performs efficient hash lookups to determine whether a row has already been encountered, giving an overall O(n) time complexity for n rows and O(n) auxiliary space to store hash keys.
This approach is concise and optimized because the library handles hashing, row comparison, and indexing internally. In production data pipelines or analytics workflows, this is the preferred solution since it is reliable and highly optimized in C-backed pandas operations. If you're working with tabular data manipulation or DataFrame operations, built-in deduplication functions are usually the best option.
Approach 2: Custom Duplicate Check (O(n) time, O(n) space)
A manual approach iterates through each row and stores a unique representation of the row inside a hash set. Each row can be converted to a tuple so it becomes hashable. During iteration, check whether the tuple already exists in the set. If not, add it to the set and keep the row in the result; otherwise skip it because it is a duplicate.
This solution uses a hash table to guarantee constant-time membership checks on average. Each row is processed exactly once, resulting in O(n) time complexity and O(n) space complexity for storing previously seen rows. While this approach is more verbose than the built-in function, it demonstrates the core deduplication logic and works even when built-in helpers are unavailable.
Recommended for interviews: The built-in drop_duplicates() approach is typically expected when working in a pandas environment because it reflects real-world data processing practices. However, interviewers sometimes ask for the underlying logic. Implementing the custom hash-set method shows you understand how duplicate detection works internally. Starting with the manual approach and then mentioning the optimized built-in API demonstrates both algorithmic thinking and practical engineering experience.
This approach utilizes built-in library functions designed to handle duplicates, providing a clean and efficient solution.
This Python solution uses the `drop_duplicates` method from the Pandas library, which is specifically designed to remove duplicate entries from a DataFrame based on certain criteria. Here, we specify the `email` column as the basis for detecting duplicates and keep the first occurrence among any found duplicates.
Python
Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates.
Space Complexity: O(1) as the operation modifies the DataFrame in place.
In this approach, we manually check for duplicates using a data structure to track seen entries, allowing us to handle duplicates without relying on specific library functions.
This Python solution uses a set to track seen email addresses while iterating through the DataFrame rows. We keep a list of unique rows, appending only those with unseen emails, thereby manually handling the de-duplication.
Python
Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once.
Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails.
Python
| Approach | Complexity |
|---|---|
| Using Built-in Functions | Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates. Space Complexity: O(1) as the operation modifies the DataFrame in place. |
| Custom Duplicate Check | Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once. Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails. |
| Default Approach | — |
| Approach | Time | Space | When to Use |
|---|---|---|---|
| Using Built-in Functions (pandas drop_duplicates) | O(n) | O(n) | Best for real-world pandas workflows and concise solutions |
| Custom Duplicate Check with Hash Set | O(n) | O(n) | When built-in utilities are unavailable or when explaining deduplication logic |
2882. Drop Duplicate Rows | LeetCode | Python | Pandas • You Data And AI • 662 views views
Watch 5 more video solutions →Practice Drop Duplicate Rows with our built-in code editor and test cases.
Practice on FleetCode