Table: Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.
For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.
For Pandas users, please note that you are supposed to modify Person in place.
After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.
The result format is in the following example.
Example 1:
Input: Person table: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | | 3 | john@example.com | +----+------------------+ Output: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | +----+------------------+ Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.
Problem Overview: The Delete Duplicate Emails problem gives you a Person table containing id and email. Multiple rows may share the same email. The task is to remove duplicate records while keeping only the row with the smallest id for each email address.
Approach 1: SQL Using EXISTS Clause (O(n^2) time, O(1) space)
This method deletes rows if another row with the same email and a smaller id already exists. The query checks duplicates using an EXISTS subquery. For every row, the database scans for a matching email with a lower identifier. If found, the current row is removed. The approach is concise and works well in relational databases that optimize correlated subqueries. This technique relies heavily on concepts from SQL and database query optimization.
Approach 2: Using SQL with Self-Join (O(n^2) time, O(1) space)
A self-join compares rows in the same table. The query joins Person p1 with Person p2 where the email values match but p1.id > p2.id. That condition identifies duplicates that should be removed because a smaller id already exists. The delete operation targets the larger id rows. Self-joins are widely used for record comparison tasks and are often easier to reason about than nested subqueries.
Approach 3: Using a Hash Table (Dictionary) (O(n) time, O(n) space)
If the data is processed in application code instead of directly in SQL, a hash table can track emails already seen. Iterate through each row, and store the first occurrence of every email in a dictionary keyed by email address. If another row with the same email appears, mark it for deletion. Hash lookups run in constant time, so the entire process completes in linear time. This technique demonstrates the power of hash tables for duplicate detection.
Approach 4: SQL-Based Duplicate Removal with Aggregation (O(n log n) time, O(n) space)
Another SQL pattern uses grouping or window functions to identify the smallest id per email. For example, you compute the minimum id for each email group and delete rows whose id is larger than that minimum. This approach is especially useful when the database engine optimizes grouping operations efficiently. It is common in production SQL cleanup tasks where deduplication happens in batches.
Recommended for interviews: The SQL EXISTS solution is the most common answer because it is compact and expresses the logic directly: delete rows where a smaller id with the same email already exists. Mentioning the self-join variant shows deeper SQL understanding. In application-layer problems, the hash table solution demonstrates algorithmic thinking and achieves O(n) time.
This approach involves using the SQL EXISTS clause to handle the duplicates. By defining a subquery that checks for existence of another row with the same email but a smaller id, we can delete the current row if such a row exists.
This SQL query deletes rows from the Person table where there exists another row within the same table that has the same email but a smaller id. Thus, it ensures that only one unique email with the smallest id is retained.
SQL
Time Complexity: O(n^2) due to the nested subquery, where n is the number of rows.
Space Complexity: O(1) as we are performing operations in place.
This approach uses a self-join to create joins within the same table. We join the table on the email field itself and use the condition to keep the entry with the smallest id.
The strategy involves creating a subquery that selects the minimum id for each email group. We then delete any row from Person where its id is not in this set of minimum ids.
SQL
Time Complexity: O(n) primarily, because the operation groups by email and selects min in linear time, although the exact operations may vary by SQL engine.
Space Complexity: O(n) due to the creation of the temporary table holding min ids.
This approach involves using a hash table (or dictionary) to store the smallest ID for each unique email. As you iterate over each email entry, check if the email already exists in the hash table. If it does, compare IDs and update the hash table with the smaller ID if necessary. After processing all entries, use the hash table data to filter out duplicates by retaining entries with only the smallest IDs.
This solution maintains a sorted list of person records, which allows for simpler deduplication. It uses qsort to sort the entries by email first and then by ID. After that, a single pass through the sorted list is sufficient to remove duplicates.
Time Complexity: O(n log n) due to the sort operation.
Space Complexity: O(1) as sorting is done in place.
This approach is specifically designed for SQL contexts. The strategy here is to delete entries from a database such that for each email, only the entry with the smallest ID is retained. This involves using SQL commands to isolate entries that have duplicate emails (but higher IDs than the minimum for each specific email) and then deleting them.
This SQL solution uses a DELETE statement with a sub-query to identify entries that are not the minimum ID for a given email. The sub-query groups by email and selects the minimum ID for each group, effectively isolating the desired entries to be retained.
SQL
Time Complexity: Generally O(n log n) for the grouping operation.
Space Complexity: O(n) with respect to holding email and associated data in memory during sub-query execution.
| Approach | Complexity |
|---|---|
| SQL Using EXISTS Clause | Time Complexity: O(n^2) due to the nested subquery, where n is the number of rows. |
| Using SQL with Self-Join | Time Complexity: O(n) primarily, because the operation groups by email and selects min in linear time, although the exact operations may vary by SQL engine. |
| Approach 1: Using a Hash Table (Dictionary) | Time Complexity: O(n log n) due to the sort operation. |
| Approach 2: SQL-Based Duplicate Removal | Time Complexity: Generally O(n log n) for the grouping operation. |
| Default Approach | — |
| Approach | Time | Space | When to Use |
|---|---|---|---|
| SQL Using EXISTS Clause | O(n^2) | O(1) | Standard SQL solution for deleting duplicates while keeping smallest id |
| SQL Self-Join | O(n^2) | O(1) | Useful when comparing rows within the same table |
| Hash Table (Dictionary) | O(n) | O(n) | When processing data outside the database in application code |
| SQL Aggregation / Grouping | O(n log n) | O(n) | When using GROUP BY or window functions to identify unique rows |
LeetCode 196: Delete Duplicate Emails [SQL] • Frederik Müller • 22,252 views views
Watch 9 more video solutions →Practice Delete Duplicate Emails with our built-in code editor and test cases.
Practice on FleetCode