
Sponsored
Sponsored
This approach utilizes built-in library functions designed to handle duplicates, providing a clean and efficient solution.
Time Complexity: O(n), where n is the number of entries in the DataFrame. The operation scans the DataFrame once to identify duplicates.
Space Complexity: O(1) as the operation modifies the DataFrame in place.
1import pandas as pd
2
3def drop_duplicate_rows(df):
4
This Python solution uses the `drop_duplicates` method from the Pandas library, which is specifically designed to remove duplicate entries from a DataFrame based on certain criteria. Here, we specify the `email` column as the basis for detecting duplicates and keep the first occurrence among any found duplicates.
In this approach, we manually check for duplicates using a data structure to track seen entries, allowing us to handle duplicates without relying on specific library functions.
Time Complexity: O(n), where n is the number of entries in the DataFrame. Each entry is processed exactly once.
Space Complexity: O(n), due to the storage needed for the list of unique rows and the set of seen emails.
1import pandas as pd
2
3def drop_duplicate_rows_custom(df):
4 seen = set()
5 unique_rows = []
6 for index, row in df.iterrows():
7 if row['email'] not in seen:
8 seen.add(row['email'])
9 unique_rows.append(row)
10 return pd.DataFrame(unique_rows)
11
12# Example usage
13if __name__ == '__main__':
14 data = {
15 'customer_id': [1, 2, 3, 4, 5, 6],
16 'name': ['Ella', 'David', 'Zachary', 'Alice', 'Finn', 'Violet'],
17 'email': ['emily@example.com', 'michael@example.com', 'sarah@example.com', 'john@example.com', 'john@example.com', 'alice@example.com']
18 }
19 df = pd.DataFrame(data)
20 result = drop_duplicate_rows_custom(df)
21 print(result)This Python solution uses a set to track seen email addresses while iterating through the DataFrame rows. We keep a list of unique rows, appending only those with unseen emails, thereby manually handling the de-duplication.