Dropping/Filtering rows and columns

Introduction Reading Time: 12 min

Table of Contents

Description

Dropping or filtering rows and columns is a fundamental step in data wrangling. It allows you to clean and reshape your dataset by removing unnecessary data, such as:
Rows or columns with missing or irrelevant values
Filtering rows based on conditions
Dropping duplicate rows or outliers
This helps in focusing on the most relevant features and improving the performance of data analysis and machine learning models.

Prerequisites

  • Basic understanding of Python and Pandas
  • Familiarity with DataFrame structure
  • Knowledge of Boolean indexing and conditions

Examples

Here's a simple example of a data science task using Python:


import pandas as pd

# Sample dataset
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'],
    'Age': [25, 22, 29, 33, 40],
    'Gender': ['M', 'F', 'M', 'F', 'M'],
    'Salary': [50000, 60000, 45000, 52000, 49000]
}
df = pd.DataFrame(data)

# Drop a column by name
df_dropped_col = df.drop(columns=['Salary'])

# Drop a row by index
df_dropped_row = df.drop(index=2)

# Filter rows where Age is greater than 30
df_filtered = df[df['Age'] > 30]

# Drop columns with all missing values
df.dropna(axis=1, how='all')

# Drop duplicate rows
df_no_duplicates = df.drop_duplicates()

# Drop rows where Salary < 50000
df_salary_filter = df[df['Salary'] >= 50000]

# Display results
print(df_dropped_col)
print(df_filtered)
          

Real-World Applications

Healthcare

Remove patients with missing diagnosis data
Filter rows with critical conditions for urgent analysis

Finance

Drop transactions with missing merchant IDs
Filter clients based on credit score thresholds

Where topic Is Applied

Healthcare

  • Removing null test results
  • Filtering patient records by age or condition

Finance

  • Cleaning loan applications by dropping incomplete rows
  • Filtering transactions above a certain amount

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ Use df.drop(columns=['column_name']).

➤ Filtering selects rows/columns that meet conditions; dropping removes them directly.

➤ Use df.dropna() to remove rows with missing values.

➤ Use Boolean indexing: df[(df['Age'] > 25) & (df['Gender'] == 'F')]

➤ When it has too many missing values or offers no useful information (low variance).