Drop Duplicate Rows - Solution & Explanation

Easy5 min readAsked at: Google

Problem Statement

DataFrame customers
+-------------+--------+
| Column Name | Type   |
+-------------+--------+
| customer_id | int    |
| name        | object |
| email       | object |
+-------------+--------+

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

The result format is in the following example.

Example 1:
Input:
+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 5           | Finn    | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+
Output:  
+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+
Explanation:
Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.

Approach Overview

Problem Overview: You are given a table-like dataset and need to remove duplicate rows while keeping the first occurrence of each unique row. The result should contain only unique rows while preserving the original order of appearance.

Approach 1: Using Built-in Functions (O(n) time, O(n) space)

The most direct solution uses the built-in drop_duplicates() function available in pandas. This method scans the DataFrame and internally tracks previously seen rows using hashing. When a duplicate row appears, it is skipped while the first instance is preserved. Under the hood, pandas performs efficient hash lookups to determine whether a row has already been encountered, giving an overall O(n) time complexity for n rows and O(n) auxiliary space to store hash keys.

This approach is concise and optimized because the library handles hashing, row comparison, and indexing internally. In production data pipelines or analytics workflows, this is the preferred solution since it is reliable and highly optimized in C-backed pandas operations. If you're working with tabular data manipulation or DataFrame operations, built-in deduplication functions are usually the best option.

Approach 2: Custom Duplicate Check (O(n) time, O(n) space)

A manual approach iterates through each row and stores a unique representation of the row inside a hash set. Each row can be converted to a tuple so it becomes hashable. During iteration, check whether the tuple already exists in the set. If not, add it to the set and keep the row in the result; otherwise skip it because it is a duplicate.

This solution uses a hash table to guarantee constant-time membership checks on average. Each row is processed exactly once, resulting in O(n) time complexity and O(n) space complexity for storing previously seen rows. While this approach is more verbose than the built-in function, it demonstrates the core deduplication logic and works even when built-in helpers are unavailable.

Recommended for interviews: The built-in drop_duplicates() approach is typically expected when working in a pandas environment because it reflects real-world data processing practices. However, interviewers sometimes ask for the underlying logic. Implementing the custom hash-set method shows you understand how duplicate detection works internally. Starting with the manual approach and then mentioning the optimized built-in API demonstrates both algorithmic thinking and practical engineering experience.

Approach 1: Using Built-in Functions

This approach utilizes built-in library functions designed to handle duplicates, providing a clean and efficient solution.

This Python solution uses the `drop_duplicates` method from the Pandas library, which is specifically designed to remove duplicate entries from a DataFrame based on certain criteria. Here, we specify the `email` column as the basis for detecting duplicates and keep the first occurrence among any found duplicates.

Code

Python

Complexity

Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates.

Space Complexity: O(1) as the operation modifies the DataFrame in place.

Try this approach in the editor →

Approach 2: Custom Duplicate Check

In this approach, we manually check for duplicates using a data structure to track seen entries, allowing us to handle duplicates without relying on specific library functions.

This Python solution uses a set to track seen email addresses while iterating through the DataFrame rows. We keep a list of unique rows, appending only those with unseen emails, thereby manually handling the de-duplication.

Code

Python

Complexity

Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once.

Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails.

Try this approach in the editor →

Approach 3: Default Approach

Code

Python

Try this approach in the editor →

Complexity Comparison

Approach	Complexity
Using Built-in Functions	Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates. Space Complexity: O(1) as the operation modifies the DataFrame in place.
Custom Duplicate Check	Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once. Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails.
Default Approach	—

Detailed Complexity Analysis

Approach	Time	Space	When to Use
Using Built-in Functions (pandas drop_duplicates)	O(n)	O(n)	Best for real-world pandas workflows and concise solutions
Custom Duplicate Check with Hash Set	O(n)	O(n)	When built-in utilities are unavailable or when explaining deduplication logic

Video Solution

2882. Drop Duplicate Rows | LeetCode | Python | Pandas • You Data And AI • 662 views views

Watch 5 more video solutions →

Frequently Asked Questions

Drop Duplicate Rows Python solution

In Python, the most common solution uses pandas: df.drop_duplicates(). This method scans the DataFrame, identifies duplicate rows, and returns a new DataFrame containing only the first occurrence of each row. A manual alternative uses a set of row tuples to filter duplicates during iteration.

Is Drop Duplicate Rows easy or hard?

Drop Duplicate Rows is generally considered an easy problem because the logic relies on straightforward hashing or built-in DataFrame operations. The main concept is identifying previously seen rows and skipping duplicates while preserving the first occurrence.

How to solve Drop Duplicate Rows in O(n)?

Use a hash-based strategy. Either rely on pandas drop_duplicates() which internally uses hashing, or iterate through rows and store each row as a tuple in a hash set. Checking membership in the set takes O(1) average time, resulting in an overall O(n) algorithm.

What is the best approach for Drop Duplicate Rows?

The most efficient approach uses the pandas drop_duplicates() function. It internally uses hashing to track previously seen rows and removes duplicates in O(n) time. This built-in method is concise, optimized, and commonly used in real-world data processing tasks.

What data structure is used in Drop Duplicate Rows?

A hash table (or hash set) is typically used to track rows that have already been seen. By converting each row into a hashable representation such as a tuple, the algorithm can quickly detect duplicates using constant-time lookups.

What is the time complexity of Drop Duplicate Rows?

The typical solution runs in O(n) time where n is the number of rows in the dataset. Each row is processed once and checked against a hash structure for duplicates. The space complexity is also O(n) because previously seen rows must be stored.

Is Drop Duplicate Rows asked at Google, Amazon, or Meta?

Duplicate removal and data deduplication problems appear frequently in data engineering and analytics interviews at companies like Google, Amazon, and Meta. Variants often involve removing duplicates from arrays, datasets, or database tables using hashing or sorting techniques.

Ready to solve this problem?

Practice Drop Duplicate Rows with our built-in code editor and test cases.

Practice on FleetCode

Problem Info

DifficultyEasy

Acceptance85.6%

Approaches3

Reading time5 min

Asked at

Google

Practice this problem

Open in Editor

Drop Duplicate Rows - Solution & Explanation

Problem Statement

Approach Overview

Approach 1: Using Built-in Functions

Code

Complexity

Approach 2: Custom Duplicate Check

Code

Complexity

Approach 3: Default Approach

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Ready to solve this problem?

Problem Info

Table of Contents

Drop Duplicate Rows - Solution & Explanation

Problem Statement

Approach Overview

Approach 1: Using Built-in Functions

Code

Complexity

Approach 2: Custom Duplicate Check

Code

Complexity

Approach 3: Default Approach

Code

Complexity Comparison

Detailed Complexity Analysis

Video Solution

Frequently Asked Questions

Ready to solve this problem?

Problem Info

Table of Contents