Delete Duplicate Emails - Solution & Explanation

EasyDatabase11 min readAsked at: Amazon, Microsoft, Meta +6

Problem Statement

Table: Person

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| email       | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.

For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.

For Pandas users, please note that you are supposed to modify Person in place.

After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.

The result format is in the following example.

Example 1:

Input: 
Person table:
+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |
+----+------------------+
Output: 
+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
+----+------------------+
Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.

Approach Overview

Problem Overview: The Delete Duplicate Emails problem gives you a Person table containing id and email. Multiple rows may share the same email. The task is to remove duplicate records while keeping only the row with the smallest id for each email address.

Approach 1: SQL Using EXISTS Clause (O(n^2) time, O(1) space)

This method deletes rows if another row with the same email and a smaller id already exists. The query checks duplicates using an EXISTS subquery. For every row, the database scans for a matching email with a lower identifier. If found, the current row is removed. The approach is concise and works well in relational databases that optimize correlated subqueries. This technique relies heavily on concepts from SQL and database query optimization.

Approach 2: Using SQL with Self-Join (O(n^2) time, O(1) space)

A self-join compares rows in the same table. The query joins Person p1 with Person p2 where the email values match but p1.id > p2.id. That condition identifies duplicates that should be removed because a smaller id already exists. The delete operation targets the larger id rows. Self-joins are widely used for record comparison tasks and are often easier to reason about than nested subqueries.

Approach 3: Using a Hash Table (Dictionary) (O(n) time, O(n) space)

If the data is processed in application code instead of directly in SQL, a hash table can track emails already seen. Iterate through each row, and store the first occurrence of every email in a dictionary keyed by email address. If another row with the same email appears, mark it for deletion. Hash lookups run in constant time, so the entire process completes in linear time. This technique demonstrates the power of hash tables for duplicate detection.

Approach 4: SQL-Based Duplicate Removal with Aggregation (O(n log n) time, O(n) space)

Another SQL pattern uses grouping or window functions to identify the smallest id per email. For example, you compute the minimum id for each email group and delete rows whose id is larger than that minimum. This approach is especially useful when the database engine optimizes grouping operations efficiently. It is common in production SQL cleanup tasks where deduplication happens in batches.

Recommended for interviews: The SQL EXISTS solution is the most common answer because it is compact and expresses the logic directly: delete rows where a smaller id with the same email already exists. Mentioning the self-join variant shows deeper SQL understanding. In application-layer problems, the hash table solution demonstrates algorithmic thinking and achieves O(n) time.

Approach 1: SQL Using EXISTS Clause

This approach involves using the SQL EXISTS clause to handle the duplicates. By defining a subquery that checks for existence of another row with the same email but a smaller id, we can delete the current row if such a row exists.

This SQL query deletes rows from the Person table where there exists another row within the same table that has the same email but a smaller id. Thus, it ensures that only one unique email with the smallest id is retained.

Code

SQL

Complexity

Time Complexity: O(n^2) due to the nested subquery, where n is the number of rows.
Space Complexity: O(1) as we are performing operations in place.

Try this approach in the editor →

Approach 2: Using SQL with Self-Join

This approach uses a self-join to create joins within the same table. We join the table on the email field itself and use the condition to keep the entry with the smallest id.

The strategy involves creating a subquery that selects the minimum id for each email group. We then delete any row from Person where its id is not in this set of minimum ids.

Code

SQL

Complexity

Time Complexity: O(n) primarily, because the operation groups by email and selects min in linear time, although the exact operations may vary by SQL engine.
Space Complexity: O(n) due to the creation of the temporary table holding min ids.

Try this approach in the editor →

Approach 3: Approach 1: Using a Hash Table (Dictionary)

This approach involves using a hash table (or dictionary) to store the smallest ID for each unique email. As you iterate over each email entry, check if the email already exists in the hash table. If it does, compare IDs and update the hash table with the smaller ID if necessary. After processing all entries, use the hash table data to filter out duplicates by retaining entries with only the smallest IDs.

This solution maintains a sorted list of person records, which allows for simpler deduplication. It uses qsort to sort the entries by email first and then by ID. After that, a single pass through the sorted list is sufficient to remove duplicates.

Code

C C++Java Python C#JavaScript

C++

Java

Python

JavaScript

Complexity

Time Complexity: O(n log n) due to the sort operation.
Space Complexity: O(1) as sorting is done in place.

Try this approach in the editor →

Approach 4: Approach 2: SQL-Based Duplicate Removal

This approach is specifically designed for SQL contexts. The strategy here is to delete entries from a database such that for each email, only the entry with the smallest ID is retained. This involves using SQL commands to isolate entries that have duplicate emails (but higher IDs than the minimum for each specific email) and then deleting them.

This SQL solution uses a DELETE statement with a sub-query to identify entries that are not the minimum ID for a given email. The sub-query groups by email and selects the minimum ID for each group, effectively isolating the desired entries to be retained.

Code

SQL

Complexity

Time Complexity: Generally O(n log n) for the grouping operation.
Space Complexity: O(n) with respect to holding email and associated data in memory during sub-query execution.

Try this approach in the editor →

Approach 5: Default Approach

Code

Python MySQL

Python

MySQL

Try this approach in the editor →

Complexity Comparison

Approach	Complexity
SQL Using EXISTS Clause	Time Complexity: O(n^2) due to the nested subquery, where n is the number of rows. Space Complexity: O(1) as we are performing operations in place.
Using SQL with Self-Join	Time Complexity: O(n) primarily, because the operation groups by email and selects min in linear time, although the exact operations may vary by SQL engine. Space Complexity: O(n) due to the creation of the temporary table holding min ids.
Approach 1: Using a Hash Table (Dictionary)	Time Complexity: O(n log n) due to the sort operation. Space Complexity: O(1) as sorting is done in place.
Approach 2: SQL-Based Duplicate Removal	Time Complexity: Generally O(n log n) for the grouping operation. Space Complexity: O(n) with respect to holding email and associated data in memory during sub-query execution.
Default Approach	—

Detailed Complexity Analysis

Approach	Time	Space	When to Use
SQL Using EXISTS Clause	O(n^2)	O(1)	Standard SQL solution for deleting duplicates while keeping smallest id
SQL Self-Join	O(n^2)	O(1)	Useful when comparing rows within the same table
Hash Table (Dictionary)	O(n)	O(n)	When processing data outside the database in application code
SQL Aggregation / Grouping	O(n log n)	O(n)	When using GROUP BY or window functions to identify unique rows

Video Solution

LeetCode 196: Delete Duplicate Emails [SQL] • Frederik Müller • 22,252 views views

Watch 9 more video solutions →

Frequently Asked Questions

Is Delete Duplicate Emails easy or hard?

Delete Duplicate Emails is classified as an Easy problem. The challenge mainly tests SQL fundamentals such as self-joins, correlated subqueries, and understanding how to keep one record while deleting duplicates.

Delete Duplicate Emails Python/Java solution

In Python or Java, store emails in a hash map or set while iterating through the records. The first occurrence of an email is kept, and later duplicates are removed or skipped. This method runs in O(n) time and uses O(n) extra memory.

How to solve Delete Duplicate Emails in O(n)?

Use a hash table keyed by email address. Iterate through each record once and store the first occurrence of every email. When a duplicate email appears, mark that row for deletion. Hash lookups take constant time, so the full scan completes in O(n).

What is the best approach for Delete Duplicate Emails?

The most common solution uses a SQL EXISTS clause. It deletes a row when another row with the same email and a smaller id already exists. This approach directly expresses the rule of keeping the smallest id and is widely accepted in SQL interviews and database tasks.

Is Delete Duplicate Emails asked at Google/Amazon/Meta?

Database cleanup and deduplication problems like this appear frequently in interviews at companies such as Amazon and Google, especially for backend or data-focused roles. Candidates are expected to demonstrate strong SQL skills including joins, subqueries, and grouping.

What data structure is used in Delete Duplicate Emails?

In SQL solutions the database engine handles comparisons using joins or subqueries. When solved programmatically, a hash table (dictionary) is used to track emails already seen and detect duplicates efficiently.

What is the time complexity of Delete Duplicate Emails?

Most SQL approaches such as EXISTS or self-join comparisons run in O(n^2) time in the worst case because each row may be compared with other rows. If the data is processed in application code using a hash table, the complexity improves to O(n) time with O(n) additional space.

Ready to solve this problem?

Practice Delete Duplicate Emails with our built-in code editor and test cases.

Practice on FleetCode

Combine Two Tables

Second Highest Salary

Problem Info

DifficultyEasy

Acceptance63.4%

Approaches5

Reading time11 min

Asked at

Amazon Microsoft Meta Infosys Google

Practice this problem

Open in Editor

Problem Statement

Approach Overview

Approach 1: SQL Using EXISTS Clause

Code

Complexity

Approach 2: Using SQL with Self-Join

Code

Complexity

Approach 3: Approach 1: Using a Hash Table (Dictionary)

Code

Complexity

Approach 4: Approach 2: SQL-Based Duplicate Removal

Code

Complexity

Approach 5: Default Approach

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Related Problems

Ready to solve this problem?

Problem Info

Table of Contents

Problem Statement

Approach Overview

Approach 1: SQL Using EXISTS Clause

Code

Complexity

Approach 2: Using SQL with Self-Join

Code

Complexity

Approach 3: Approach 1: Using a Hash Table (Dictionary)

Code

Complexity

Approach 4: Approach 2: SQL-Based Duplicate Removal

Code

Complexity

Approach 5: Default Approach

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Related Problems

Ready to solve this problem?

Problem Info

Table of Contents