DataFrame customers +-------------+--------+ | Column Name | Type | +-------------+--------+ | customer_id | int | | name | object | | email | object | +-------------+--------+
There are some duplicate rows in the DataFrame based on the email column.
Write a solution to remove these duplicate rows and keep only the first occurrence.
The result format is in the following example.
Example 1: Input: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 5 | Finn | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Output: +-------------+---------+---------------------+ | customer_id | name | email | +-------------+---------+---------------------+ | 1 | Ella | emily@example.com | | 2 | David | michael@example.com | | 3 | Zachary | sarah@example.com | | 4 | Alice | john@example.com | | 6 | Violet | alice@example.com | +-------------+---------+---------------------+ Explanation: Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.
This approach utilizes built-in library functions designed to handle duplicates, providing a clean and efficient solution.
This Python solution uses the `drop_duplicates` method from the Pandas library, which is specifically designed to remove duplicate entries from a DataFrame based on certain criteria. Here, we specify the `email` column as the basis for detecting duplicates and keep the first occurrence among any found duplicates.
Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates.
Space Complexity: O(1) as the operation modifies the DataFrame in place.
In this approach, we manually check for duplicates using a data structure to track seen entries, allowing us to handle duplicates without relying on specific library functions.
This Python solution uses a set to track seen email addresses while iterating through the DataFrame rows. We keep a list of unique rows, appending only those with unseen emails, thereby manually handling the de-duplication.
Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once.
Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails.
| Approach | Complexity |
|---|---|
| Using Built-in Functions | Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates. Space Complexity: O(1) as the operation modifies the DataFrame in place. |
| Custom Duplicate Check | Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once. Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails. |
Contains Duplicate - Leetcode 217 - Python • NeetCode • 770,501 views views
Watch 9 more video solutions →Practice Drop Duplicate Rows with our built-in code editor and test cases.
Practice on FleetCode