Table: Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.
For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.
For Pandas users, please note that you are supposed to modify Person in place.
After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.
The result format is in the following example.
Example 1:
Input: Person table: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | | 3 | john@example.com | +----+------------------+ Output: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | +----+------------------+ Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.
Problem Overview: You have a Person table where multiple rows may share the same email. The task is to delete duplicate rows while keeping only the record with the smallest Id for each email. The challenge is identifying duplicates and removing them without affecting the earliest entry.
Approach 1: SQL Using EXISTS Clause (Time: O(n²), Space: O(1))
This approach checks whether another row exists with the same email but a smaller Id. If such a row exists, the current row is a duplicate and should be deleted. The query uses an EXISTS subquery to compare rows within the same table. Conceptually, the database scans rows and performs a conditional lookup to see if a smaller Id already represents that email. This approach works well in most relational databases and keeps the logic compact using pure SQL.
Approach 2: Using SQL with Self-Join (Time: O(n²), Space: O(1))
The self-join method joins the Person table with itself on matching email values. During the join, rows with larger Id values are identified as duplicates when another row with the same email but smaller Id exists. Those larger Id rows are then deleted. This technique explicitly shows the pairwise comparison between rows and is easy to reason about when debugging duplicate detection logic in database queries.
Approach 3: Using a Hash Table (Dictionary) (Time: O(n), Space: O(n))
If the dataset is processed in application code instead of pure SQL, a hash table provides an efficient solution. Iterate through all rows and store emails in a dictionary keyed by email. When a duplicate email appears, compare the stored Id and keep the smaller one. Any later occurrence with a larger Id is marked for deletion. Hash lookups run in constant time, giving an overall linear runtime.
Approach 4: SQL-Based Duplicate Removal with Aggregation (Time: O(n log n), Space: O(n))
This method groups rows by email using GROUP BY and computes the minimum Id per email. Rows whose Id is not the minimum are deleted. Internally, the database performs grouping or sorting operations to determine which row should be preserved. This approach is readable and works well when aggregation logic already exists in your query pipeline.
Recommended for interviews: The EXISTS or self-join SQL solutions are typically expected in database interviews because they demonstrate strong relational query thinking. The hash table approach is useful when the problem is framed in application code. Showing both the brute-force relational comparison and the optimized lookup strategy signals solid understanding of duplicate detection problems.
This approach involves using the SQL EXISTS clause to handle the duplicates. By defining a subquery that checks for existence of another row with the same email but a smaller id, we can delete the current row if such a row exists.
This SQL query deletes rows from the Person table where there exists another row within the same table that has the same email but a smaller id. Thus, it ensures that only one unique email with the smallest id is retained.
Time Complexity: O(n^2) due to the nested subquery, where n is the number of rows.
Space Complexity: O(1) as we are performing operations in place.
This approach uses a self-join to create joins within the same table. We join the table on the email field itself and use the condition to keep the entry with the smallest id.
The strategy involves creating a subquery that selects the minimum id for each email group. We then delete any row from Person where its id is not in this set of minimum ids.
Time Complexity: O(n) primarily, because the operation groups by email and selects min in linear time, although the exact operations may vary by SQL engine.
Space Complexity: O(n) due to the creation of the temporary table holding min ids.
This approach involves using a hash table (or dictionary) to store the smallest ID for each unique email. As you iterate over each email entry, check if the email already exists in the hash table. If it does, compare IDs and update the hash table with the smaller ID if necessary. After processing all entries, use the hash table data to filter out duplicates by retaining entries with only the smallest IDs.
This solution maintains a sorted list of person records, which allows for simpler deduplication. It uses qsort to sort the entries by email first and then by ID. After that, a single pass through the sorted list is sufficient to remove duplicates.
C++
Java
Python
C#
JavaScript
Time Complexity: O(n log n) due to the sort operation.
Space Complexity: O(1) as sorting is done in place.
This approach is specifically designed for SQL contexts. The strategy here is to delete entries from a database such that for each email, only the entry with the smallest ID is retained. This involves using SQL commands to isolate entries that have duplicate emails (but higher IDs than the minimum for each specific email) and then deleting them.
This SQL solution uses a DELETE statement with a sub-query to identify entries that are not the minimum ID for a given email. The sub-query groups by email and selects the minimum ID for each group, effectively isolating the desired entries to be retained.
Time Complexity: Generally O(n log n) for the grouping operation.
Space Complexity: O(n) with respect to holding email and associated data in memory during sub-query execution.
| Approach | Complexity |
|---|---|
| SQL Using EXISTS Clause | Time Complexity: O(n^2) due to the nested subquery, where n is the number of rows. |
| Using SQL with Self-Join | Time Complexity: O(n) primarily, because the operation groups by email and selects min in linear time, although the exact operations may vary by SQL engine. |
| Approach 1: Using a Hash Table (Dictionary) | Time Complexity: O(n log n) due to the sort operation. |
| Approach 2: SQL-Based Duplicate Removal | Time Complexity: Generally O(n log n) for the grouping operation. |
| Approach | Time | Space | When to Use |
|---|---|---|---|
| SQL Using EXISTS Clause | O(n²) | O(1) | Common SQL interview solution for removing duplicates directly in the database |
| SQL Self-Join | O(n²) | O(1) | Useful when explicit row comparison between duplicate entries is needed |
| Hash Table (Dictionary) | O(n) | O(n) | When handling duplicates in application code outside the database |
| SQL Aggregation with GROUP BY | O(n log n) | O(n) | When aggregation queries already exist and minimum id per email must be preserved |
LeetCode 196: Delete Duplicate Emails [SQL] • Frederik Müller • 20,369 views views
Watch 9 more video solutions →Practice Delete Duplicate Emails with our built-in code editor and test cases.
Practice on FleetCode