encoding-categorical
Table of Contents
Description
Machine Learning models work only with numerical data. Encoding categorical variables is the process of converting string categories (like "Male", "Female" or "Red", "Blue") into numerical form so that ML algorithms can process them.
There are two main techniques:
Label Encoding
Converts categories into integers (e.g., "Male" → 0, "Female" → 1).
Best for ordinal data (where order matters).
One-Hot Encoding
Creates binary columns for each category.
Best for nominal data (no order).
Prerequisites
- Understanding of Pandas DataFrames
- Basic ML knowledge (especially feature preprocessing)
- Familiarity with sklearn.preprocessing tools
Examples
Here's a simple example of a data science task using Python:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample data
df = pd.DataFrame({
'Gender': ['Male', 'Female', 'Female', 'Male'],
'Color': ['Red', 'Blue', 'Green', 'Blue']
})
# Label Encoding for 'Gender'
le = LabelEncoder()
df['Gender_Label'] = le.fit_transform(df['Gender']) # Male=1, Female=0
# One-Hot Encoding for 'Color'
df_onehot = pd.get_dummies(df['Color'], prefix='Color')
# Combine with original DataFrame
df_encoded = pd.concat([df, df_onehot], axis=1)
print(df_encoded)
Real-World Applications
Finance: Encoding loan types, account types, or customer regions for fraud/risk models
Healthcare: Encoding patient gender, disease types, or hospital names for predictive diagnosis
E-commerce: Converting product categories, user segments, or payment methods for recommendation models
Where topic Is Applied
Finance
- Encoding account type, employment status for credit scoring models
E-commerce
- Product category, user device type, and order type encoding
Marketing
- Encoding gender, device, and ad-type for customer targeting
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ Label encoding assigns integers to categories. Use it only when the categories have an inherent order (ordinal data).
➤ It introduces a false sense of order, which can mislead algorithms into thinking one category is greater than another.
➤ One-hot encoding creates separate binary columns for each category, avoiding the issue of implied order.
➤ Yes, you must save the encoder fitted on training data and apply .transform() to test data to maintain consistency.
➤ One-hot encoding can lead to high-dimensional data (curse of dimensionality). In such cases, techniques like target encoding or embeddings are better.