Dropping/Filtering rows and columns
Table of Contents
Description
Dropping or filtering rows and columns is a fundamental step in data wrangling. It allows you to clean and reshape your dataset by removing unnecessary data, such as:
Rows or columns with missing or irrelevant values
Filtering rows based on conditions
Dropping duplicate rows or outliers
This helps in focusing on the most relevant features and improving the performance of data analysis and machine learning models.
Prerequisites
- Basic understanding of Python and Pandas
- Familiarity with DataFrame structure
- Knowledge of Boolean indexing and conditions
Examples
Here's a simple example of a data science task using Python:
import pandas as pd
# Sample dataset
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'],
'Age': [25, 22, 29, 33, 40],
'Gender': ['M', 'F', 'M', 'F', 'M'],
'Salary': [50000, 60000, 45000, 52000, 49000]
}
df = pd.DataFrame(data)
# Drop a column by name
df_dropped_col = df.drop(columns=['Salary'])
# Drop a row by index
df_dropped_row = df.drop(index=2)
# Filter rows where Age is greater than 30
df_filtered = df[df['Age'] > 30]
# Drop columns with all missing values
df.dropna(axis=1, how='all')
# Drop duplicate rows
df_no_duplicates = df.drop_duplicates()
# Drop rows where Salary < 50000
df_salary_filter = df[df['Salary'] >= 50000]
# Display results
print(df_dropped_col)
print(df_filtered)
Real-World Applications
Healthcare
Remove patients with missing diagnosis data
Filter rows with critical conditions for urgent analysis
Finance
Drop transactions with missing merchant IDs
Filter clients based on credit score thresholds
Where topic Is Applied
Healthcare
- Removing null test results
- Filtering patient records by age or condition
Finance
- Cleaning loan applications by dropping incomplete rows
- Filtering transactions above a certain amount
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ Use df.drop(columns=['column_name']).
➤ Filtering selects rows/columns that meet conditions; dropping removes them directly.
➤ Use df.dropna() to remove rows with missing values.
➤ Use Boolean indexing: df[(df['Age'] > 25) & (df['Gender'] == 'F')]
➤ When it has too many missing values or offers no useful information (low variance).