#2882 Drop Duplicate Rows - Solution

DataFrame customers
+-------------+--------+
| Column Name | Type   |
+-------------+--------+
| customer_id | int    |
| name        | object |
| email       | object |
+-------------+--------+

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

The result format is in the following example.

Example 1:
Input:
+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 5           | Finn    | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+
Output:  
+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+
Explanation:
Alic (customer_id = 4) and Finn (customer_id = 5) both use john@example.com, so only the first occurrence of this email is retained.

In #2882 Drop Duplicate Rows, the key idea is to identify rows that appear more than once and keep only the first occurrence. A common strategy is to track previously seen rows using a hash-based data structure such as a set or hash map. As you iterate through the dataset, convert each row into a hashable representation and check whether it has already been encountered.

If the row is new, add it to the tracking structure and include it in the result. If it has already been seen, simply skip it. This ensures that only unique rows remain while preserving their original order. The hashing approach works efficiently because membership checks in a set are typically constant time.

An alternative idea is sorting the rows first and then removing adjacent duplicates, but this may change ordering and add extra overhead. The hash-based method is usually preferred for its simplicity and efficiency.