Value counts, groupby summaries
Table of Contents
Description
value_counts() and groupby() are fundamental operations in Pandas used for summarizing categorical and grouped data:
value_counts(): Counts occurrences of each unique value in a Series.
groupby(): Groups data based on one or more columns and applies aggregation functions (like mean(), sum(), count(), etc.) to summarize each group.
These are powerful tools in exploratory data analysis (EDA).
Prerequisites
- Python basics,Pandas library
- Understanding of DataFrames and Series
- Familiarity with aggregation functions (mean(), sum(), etc.)
Examples
Here's a simple example of a data science task using Python:
import pandas as pd
# Sample dataset
data = {
'Department': ['HR', 'HR', 'IT', 'IT', 'IT', 'Sales', 'Sales'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Salary': [50000, 52000, 60000, 58000, 62000, 45000, 47000]
}
df = pd.DataFrame(data)
# Value counts: Frequency of each department
dept_counts = df['Department'].value_counts()
print("Value Counts:\n", dept_counts)
# Groupby summaries: Average salary by department
salary_summary = df.groupby('Department')['Salary'].mean()
print("\nAverage Salary by Department:\n", salary_summary)
# Groupby with multiple aggregates
grouped_summary = df.groupby('Department')['Salary'].agg(['mean', 'sum', 'max', 'min', 'count'])
print("\nGroupBy with Multiple Aggregations:\n", grouped_summary)
Real-World Applications
Value counts for transaction types, Grouping expenses by category/month/year
Healthcare: Counting diagnosis types Summarizing average treatment cost per department
E-commerce: Number of orders per product/category, Grouping by user to find average order value
Where topic Is Applied
Finance
- Grouping accounts by type or branch
Retail
- Count of sold items by category
Logistics
- Deliveries grouped by region, value counts by carrier
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ It returns a Series containing counts of unique values in a Series.
➤ When you need to summarize or aggregate data based on categories or groups.
➤ Yes, using .agg() with a list of functions like ['mean', 'sum', 'count'].
➤ groupby() is more programmatic and flexible, while pivot_table() is table-oriented and used for reshaping.
➤ A hierarchical index (MultiIndex) is created, and aggregation is applied per combination of those columns.