DataFrame customers +-------------+--------+ | Column Name | Type | +-------------+--------+ | customer_id | int | | name | object | | email | object | +-------------+--------+
There are some duplicate rows in the DataFrame based on the email column.
Write a solution to remove these duplicate rows and keep only the first occurrence.
The result format is in the following example.
Example 1: Input: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 5 | Finn | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Output: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Explanation: Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.
In #2882 Drop Duplicate Rows, the key idea is to identify rows that appear more than once and keep only the first occurrence. A common strategy is to track previously seen rows using a hash-based data structure such as a set or hash map. As you iterate through the dataset, convert each row into a hashable representation and check whether it has already been encountered.
If the row is new, add it to the tracking structure and include it in the result. If it has already been seen, simply skip it. This ensures that only unique rows remain while preserving their original order. The hashing approach works efficiently because membership checks in a set are typically constant time.
An alternative idea is sorting the rows first and then removing adjacent duplicates, but this may change ordering and add extra overhead. The hash-based method is usually preferred for its simplicity and efficiency.
| Approach | Time Complexity | Space Complexity |
|---|---|---|
| Hash Set / Hash Map Tracking | O(n) | O(n) |
| Sorting + Adjacent Comparison | O(n log n) | O(1) or O(n) |
NeetCode
Use these hints if you're stuck. Try solving on your own first.
Consider using a build-in function in pandas library to remove the duplicate rows based on specified data.
This approach utilizes built-in library functions designed to handle duplicates, providing a clean and efficient solution.
Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates.
Space Complexity: O(1) as the operation modifies the DataFrame in place.
1import pandas as pd
2
3def drop_duplicate_rows(df):
4 # Drop duplicates based on the 'email' column, keeping the first occurrence
5 return df.drop_duplicates(subset='email', keep='first')
6
7# Example usage
8if __name__ == '__main__':
9 data = {
10 'customer_id': [1, 2, 3, 4, 5, 6],
11 'name': ['Ella', 'David', 'Zachary', 'Alice', 'Finn', 'Violet'],
12 'email': ['emily@example.com', 'michael@example.com', 'sarah@example.com', 'john@example.com', 'john@example.com', 'alice@example.com']
13 }
14 df = pd.DataFrame(data)
15 result = drop_duplicate_rows(df)
16 print(result)This Python solution uses the `drop_duplicates` method from the Pandas library, which is specifically designed to remove duplicate entries from a DataFrame based on certain criteria. Here, we specify the `email` column as the basis for detecting duplicates and keep the first occurrence among any found duplicates.
In this approach, we manually check for duplicates using a data structure to track seen entries, allowing us to handle duplicates without relying on specific library functions.
Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once.
Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails.
1import pandas as pd
2
3def drop_duplicate_rows_custom(df):
4
Watch expert explanations and walkthroughs
Jot down your thoughts, approach, and key learnings
Yes, variations of duplicate removal problems appear frequently in coding interviews. They test understanding of hashing, data deduplication, and efficient iteration over collections.
A hash-based structure such as a set or hash map works best for this problem. It allows fast lookups to determine whether a row has already been encountered, making the overall process efficient.
The optimal approach typically uses a hash set to track rows that have already appeared. As you iterate through the rows, you add unseen ones to the set and skip duplicates. This allows duplicate detection in constant average time.
Yes, one alternative is sorting the rows and then removing adjacent duplicates. However, this approach increases time complexity and may alter the original order of rows, which is often undesirable.
This Python solution uses a set to track seen email addresses while iterating through the DataFrame rows. We keep a list of unique rows, appending only those with unseen emails, thereby manually handling the de-duplication.