Table: Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.
For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.
For Pandas users, please note that you are supposed to modify Person in place.
After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.
The result format is in the following example.
Example 1:
Input: Person table: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | | 3 | john@example.com | +----+------------------+ Output: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | +----+------------------+ Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.
The goal of #196 Delete Duplicate Emails is to remove duplicate email records from a table while keeping only one instance of each email. Typically, each row contains an id and an email, and the smallest id should be preserved for each unique email.
A common strategy is to use SQL operations that identify duplicates based on the email column and then delete rows whose id is not the minimum within that group. This can be achieved using techniques such as self-joins, subqueries, or GROUP BY with aggregation. The key idea is to compare rows with the same email and retain the one with the smallest identifier.
Efficient indexing on the email column can significantly improve performance when scanning and grouping records. Most database engines can execute this logic in linear or near-linear time depending on indexing and query planning.
| Approach | Time Complexity | Space Complexity |
|---|---|---|
| Self Join to Identify Duplicate Emails | O(n) | O(1) |
| Subquery with MIN(id) per Email | O(n log n) | O(1) |
| GROUP BY with Aggregation | O(n log n) | O(1) |
Frederik Müller
This approach involves using the SQL EXISTS clause to handle the duplicates. By defining a subquery that checks for existence of another row with the same email but a smaller id, we can delete the current row if such a row exists.
Time Complexity: O(n^2) due to the nested subquery, where n is the number of rows.
Space Complexity: O(1) as we are performing operations in place.
1DELETE p1 FROM Person p1 WHERE EXISTS (SELECT 1 FROM Person p2 WHERE p1.email = p2.email AND p1.id > p2.id);This SQL query deletes rows from the Person table where there exists another row within the same table that has the same email but a smaller id. Thus, it ensures that only one unique email with the smallest id is retained.
This approach uses a self-join to create joins within the same table. We join the table on the email field itself and use the condition to keep the entry with the smallest id.
Time Complexity: O(n) primarily, because the operation groups by email and selects min in linear time, although the exact operations may vary by SQL engine.
Space Complexity: O(n) due to the creation of the temporary table holding min ids.
1DELETE FROM Person WHERE id NOT IN (SELECT min_id FROM (SELECT MIN(id) This approach involves using a hash table (or dictionary) to store the smallest ID for each unique email. As you iterate over each email entry, check if the email already exists in the hash table. If it does, compare IDs and update the hash table with the smaller ID if necessary. After processing all entries, use the hash table data to filter out duplicates by retaining entries with only the smallest IDs.
Time Complexity: O(n log n) due to the sort operation.
Space Complexity: O(1) as sorting is done in place.
1
This approach is specifically designed for SQL contexts. The strategy here is to delete entries from a database such that for each email, only the entry with the smallest ID is retained. This involves using SQL commands to isolate entries that have duplicate emails (but higher IDs than the minimum for each specific email) and then deleting them.
Time Complexity: Generally O(n log n) for the grouping operation.
Space Complexity: O(n) with respect to holding email and associated data in memory during sub-query execution.
1DELETE FROM Person
2WHERE id NOT IN (
3 SELECTWatch expert explanations and walkthroughs
Practice problems asked by these companies to ace your technical interviews.
Explore More ProblemsJot down your thoughts, approach, and key learnings
Yes, database cleanup and deduplication problems are common in technical interviews. Questions like this test your understanding of SQL joins, grouping, and safe deletion strategies.
A common optimal approach is to identify duplicates by grouping rows with the same email and keeping the record with the smallest id. SQL techniques like self-joins or subqueries help delete rows whose id is larger than the minimum for that email.
Understanding grouping and row comparison is key. Concepts such as GROUP BY, aggregation functions like MIN(), and self-joins allow you to detect duplicate records and safely remove the extra entries.
Indexes on the email column help databases quickly locate and group duplicate values. Efficient indexing reduces scanning overhead and improves performance when comparing rows with the same email.
The strategy involves creating a subquery that selects the minimum id for each email group. We then delete any row from Person where its id is not in this set of minimum ids.
Using an object in JavaScript to store the smallest ID associated with each email, this solution processes the list of persons and constructs a new list from the recorded unique emails and their smallest IDs within the object.
This SQL solution uses a DELETE statement with a sub-query to identify entries that are not the minimum ID for a given email. The sub-query groups by email and selects the minimum ID for each group, effectively isolating the desired entries to be retained.