Watch 10 video solutions for Delete Duplicate Emails, a easy level problem involving Database. This walkthrough by Frederik Müller has 20,369 views views. Want to try solving it yourself? Practice on FleetCode or read the detailed text solution.
Table: Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.
For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.
For Pandas users, please note that you are supposed to modify Person in place.
After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.
The result format is in the following example.
Example 1:
Input: Person table: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | | 3 | john@example.com | +----+------------------+ Output: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | +----+------------------+ Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.
Problem Overview: The Delete Duplicate Emails problem gives you a Person table containing id and email. Multiple rows may share the same email. The task is to remove duplicate records while keeping only the row with the smallest id for each email address.
Approach 1: SQL Using EXISTS Clause (O(n^2) time, O(1) space)
This method deletes rows if another row with the same email and a smaller id already exists. The query checks duplicates using an EXISTS subquery. For every row, the database scans for a matching email with a lower identifier. If found, the current row is removed. The approach is concise and works well in relational databases that optimize correlated subqueries. This technique relies heavily on concepts from SQL and database query optimization.
Approach 2: Using SQL with Self-Join (O(n^2) time, O(1) space)
A self-join compares rows in the same table. The query joins Person p1 with Person p2 where the email values match but p1.id > p2.id. That condition identifies duplicates that should be removed because a smaller id already exists. The delete operation targets the larger id rows. Self-joins are widely used for record comparison tasks and are often easier to reason about than nested subqueries.
Approach 3: Using a Hash Table (Dictionary) (O(n) time, O(n) space)
If the data is processed in application code instead of directly in SQL, a hash table can track emails already seen. Iterate through each row, and store the first occurrence of every email in a dictionary keyed by email address. If another row with the same email appears, mark it for deletion. Hash lookups run in constant time, so the entire process completes in linear time. This technique demonstrates the power of hash tables for duplicate detection.
Approach 4: SQL-Based Duplicate Removal with Aggregation (O(n log n) time, O(n) space)
Another SQL pattern uses grouping or window functions to identify the smallest id per email. For example, you compute the minimum id for each email group and delete rows whose id is larger than that minimum. This approach is especially useful when the database engine optimizes grouping operations efficiently. It is common in production SQL cleanup tasks where deduplication happens in batches.
Recommended for interviews: The SQL EXISTS solution is the most common answer because it is compact and expresses the logic directly: delete rows where a smaller id with the same email already exists. Mentioning the self-join variant shows deeper SQL understanding. In application-layer problems, the hash table solution demonstrates algorithmic thinking and achieves O(n) time.
| Approach | Time | Space | When to Use |
|---|---|---|---|
| SQL Using EXISTS Clause | O(n^2) | O(1) | Standard SQL solution for deleting duplicates while keeping smallest id |
| SQL Self-Join | O(n^2) | O(1) | Useful when comparing rows within the same table |
| Hash Table (Dictionary) | O(n) | O(n) | When processing data outside the database in application code |
| SQL Aggregation / Grouping | O(n log n) | O(n) | When using GROUP BY or window functions to identify unique rows |