Watch 6 video solutions for Drop Duplicate Rows, a easy level problem. This walkthrough by You Data And AI has 662 views views. Want to try solving it yourself? Practice on FleetCode or read the detailed text solution.
DataFrame customers +-------------+--------+ | Column Name | Type | +-------------+--------+ | customer_id | int | | name | object | | email | object | +-------------+--------+
There are some duplicate rows in the DataFrame based on the email column.
Write a solution to remove these duplicate rows and keep only the first occurrence.
The result format is in the following example.
Example 1: Input: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 5 | Finn | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Output: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Explanation: Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.
Problem Overview: You are given a table-like dataset and need to remove duplicate rows while keeping the first occurrence of each unique row. The result should contain only unique rows while preserving the original order of appearance.
Approach 1: Using Built-in Functions (O(n) time, O(n) space)
The most direct solution uses the built-in drop_duplicates() function available in pandas. This method scans the DataFrame and internally tracks previously seen rows using hashing. When a duplicate row appears, it is skipped while the first instance is preserved. Under the hood, pandas performs efficient hash lookups to determine whether a row has already been encountered, giving an overall O(n) time complexity for n rows and O(n) auxiliary space to store hash keys.
This approach is concise and optimized because the library handles hashing, row comparison, and indexing internally. In production data pipelines or analytics workflows, this is the preferred solution since it is reliable and highly optimized in C-backed pandas operations. If you're working with tabular data manipulation or DataFrame operations, built-in deduplication functions are usually the best option.
Approach 2: Custom Duplicate Check (O(n) time, O(n) space)
A manual approach iterates through each row and stores a unique representation of the row inside a hash set. Each row can be converted to a tuple so it becomes hashable. During iteration, check whether the tuple already exists in the set. If not, add it to the set and keep the row in the result; otherwise skip it because it is a duplicate.
This solution uses a hash table to guarantee constant-time membership checks on average. Each row is processed exactly once, resulting in O(n) time complexity and O(n) space complexity for storing previously seen rows. While this approach is more verbose than the built-in function, it demonstrates the core deduplication logic and works even when built-in helpers are unavailable.
Recommended for interviews: The built-in drop_duplicates() approach is typically expected when working in a pandas environment because it reflects real-world data processing practices. However, interviewers sometimes ask for the underlying logic. Implementing the custom hash-set method shows you understand how duplicate detection works internally. Starting with the manual approach and then mentioning the optimized built-in API demonstrates both algorithmic thinking and practical engineering experience.
| Approach | Time | Space | When to Use |
|---|---|---|---|
| Using Built-in Functions (pandas drop_duplicates) | O(n) | O(n) | Best for real-world pandas workflows and concise solutions |
| Custom Duplicate Check with Hash Set | O(n) | O(n) | When built-in utilities are unavailable or when explaining deduplication logic |