DataFrame students +-------------+--------+ | Column Name | Type | +-------------+--------+ | student_id | int | | name | object | | age | int | +-------------+--------+
There are some rows having missing values in the name column.
Write a solution to remove the rows with missing values.
The result format is in the following example.
Example 1:
Input: +------------+---------+-----+ | student_id | name | age | +------------+---------+-----+ | 32 | Piper | 5 | | 217 | None | 19 | | 779 | Georgia | 20 | | 849 | Willow | 14 | +------------+---------+-----+ Output: +------------+---------+-----+ | student_id | name | age | +------------+---------+-----+ | 32 | Piper | 5 | | 779 | Georgia | 20 | | 849 | Willow | 14 | +------------+---------+-----+ Explanation: Student with id 217 havs empty value in the name column, so it will be removed.
The key idea in #2883 Drop Missing Data is to clean a dataset by removing rows that contain missing values in a specific column. In many data processing tasks, missing or NULL/NaN entries can affect analysis, so filtering them out is a common preprocessing step.
A straightforward approach is to scan the dataset and keep only the rows where the target column contains a valid value. In data-processing libraries such as Pandas, this can be done using functions like dropna() or by applying a boolean filter that checks whether the column value is not null.
The algorithm processes each row once and decides whether it should remain in the dataset. Because it only performs a single pass over the rows, the method is efficient and easy to implement. The time complexity is O(n), where n is the number of rows, while the extra space usage is minimal since filtering typically operates directly on the existing structure.
| Approach | Time Complexity | Space Complexity |
|---|---|---|
Filter rows with non-missing values (e.g., using dropna or boolean mask) | O(n) | O(1) |
codebasics
Use these hints if you're stuck. Try solving on your own first.
Consider using a build-in function in pandas library to remove the rows with missing values based on specified data.
This approach leverages a filtering method to iterate over each row and eliminate the rows with missing 'name' values. The method checks for nullity in the 'name' column and keeps only those rows where 'name' is not null.
Time Complexity: O(n), where n is the number of rows, as we potentially check each row.
Space Complexity: O(1), as we are modifying the DataFrame in place (though Pandas may create a copy depending on the operation).
1import pandas as pd
2
3def drop_missing_data(students):
4 return students.dropna(subset=['name'])
5In Python, Pandas provides a convenient dropna function which is used to remove missing values. Here, we specify the 'name' column in the subset parameter to ensure only rows where the 'name' is missing are dropped.
This approach manually iterates over each row in the dataset, checking if the 'name' field is missing. Rows with missing 'name' values are filtered out, which can be useful in environments that lack high-level filtering functions.
Time Complexity: O(n), to iterate over each student.
Space Complexity: O(n), to store the valid students in the new array.
1#include <stdio.h>
2#include <stdlib.h>
Watch expert explanations and walkthroughs
Jot down your thoughts, approach, and key learnings
Directly, this exact problem is uncommon in FAANG interviews, but the concept of handling missing or invalid data appears frequently in data engineering and data science interviews. Understanding dataset cleaning techniques is very valuable.
A tabular data structure such as a DataFrame is ideal for this problem because it allows efficient row filtering and column-based operations. DataFrames provide built-in methods for detecting and removing missing values quickly.
The optimal approach is to filter out rows where the required column contains null or missing values. In data processing tools like Pandas, this is typically done using functions such as dropna() or a boolean mask that checks for non-null entries.
Missing data can lead to inaccurate analysis, errors in processing, or misleading results. Removing or handling null values ensures that algorithms and analytics pipelines operate on clean and reliable data.
This C program defines a structure Student and iterates over each student, checking if the name is NULL. Valid entries are copied to a new array, which is returned after the loop.