Pandas
Table of Contents
Description
Pandas is a fast, powerful, and flexible open-source data analysis and manipulation library for Python. It provides two primary data structures:
Series: 1D labeled array.
DataFrame: 2D labeled, tabular structure.
Pandas makes it easy to perform indexing, filtering, grouping, merging, and cleaning on structured data.
Prerequisites
- Understanding of basic Python
- Familiarity with NumPy
- Concept of rows and columns in tables
Examples
Here's a simple example of a data science task using Python:
import pandas as pd
# Creating a Series
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print("Series:\n", s)
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print("DataFrame:\n", df)
# Indexing
print("First row:\n", df.loc[0])
print("Name column:\n", df['Name'])
# Filtering
print("Age > 25:\n", df[df['Age'] > 25])
# Grouping
grouped = df.groupby('Age')
for age, group in grouped:
print(f"\nGroup: Age = {age}\n", group)
Real-World Applications
Data Analysis
Load, clean, and analyze structured data from CSV, Excel, databases
Perform descriptive stats, trends, and summarizations
Image & Signal Processing
Representing pixel data as arrays
Applying filters via convolution
Finance
Time-series stock data analysis
Portfolio performance calculations
Where topic Is Applied
Healthcare
- Analyzing patient records and lab test results
- Grouping patients by age, disease, or treatment plans
E-commerce
- User purchase behavior
- Grouping orders by product/category
Machine Learning
- Underlying numerical operations in models like linear regression, PCA
- Data preprocessing and augmentation
Robotics
- Coordinate transformations and movement control using arrays
- Sensor data processing with broadcasting
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ Pandas is a Python library that provides data structures like Series and DataFrame for efficient manipulation and analysis of structured data.
➤ A Series is a 1-dimensional labeled array, whereas a DataFrame is a 2-dimensional table of data with rows and columns.
➤ You can use boolean indexing:
df[df['Age'] > 30] returns rows where Age > 30.
➤ groupby() is used to split data into groups based on a column and then apply functions like sum, mean, or count to each group.
➤ .loc[] uses labels (names), while .iloc[] uses integer-based indexing.