Home › Topics › data-transformations › Feature scaling

encoding-categorical

Introduction Reading Time: 12 min

Description
Prerequisites
Examples
Real-World Applications
Where Data topic Is Applied
Resources
Interview Questions

Description

Machine Learning models work only with numerical data. Encoding categorical variables is the process of converting string categories (like "Male", "Female" or "Red", "Blue") into numerical form so that ML algorithms can process them. There are two main techniques:

Label Encoding

Converts categories into integers (e.g., "Male" → 0, "Female" → 1).
Best for ordinal data (where order matters).

One-Hot Encoding

Creates binary columns for each category.
Best for nominal data (no order).

Prerequisites

Understanding of Pandas DataFrames
Basic ML knowledge (especially feature preprocessing)
Familiarity with sklearn.preprocessing tools

Examples

Here's a simple example of a data science task using Python:


import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Color': ['Red', 'Blue', 'Green', 'Blue']
})

# Label Encoding for 'Gender'
le = LabelEncoder()
df['Gender_Label'] = le.fit_transform(df['Gender'])  # Male=1, Female=0

# One-Hot Encoding for 'Color'
df_onehot = pd.get_dummies(df['Color'], prefix='Color')

# Combine with original DataFrame
df_encoded = pd.concat([df, df_onehot], axis=1)

print(df_encoded)

Real-World Applications

Finance: Encoding loan types, account types, or customer regions for fraud/risk models

Healthcare: Encoding patient gender, disease types, or hospital names for predictive diagnosis

E-commerce: Converting product categories, user segments, or payment methods for recommendation models

Where topic Is Applied

Finance

Encoding account type, employment status for credit scoring models

E-commerce

Product category, user device type, and order type encoding

Marketing

Encoding gender, device, and ad-type for customer targeting

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ Label encoding assigns integers to categories. Use it only when the categories have an inherent order (ordinal data).

➤ It introduces a false sense of order, which can mislead algorithms into thinking one category is greater than another.

➤ One-hot encoding creates separate binary columns for each category, avoiding the issue of implied order.

➤ Yes, you must save the encoder fitted on training data and apply .transform() to test data to maintain consistency.

➤ One-hot encoding can lead to high-dimensional data (curse of dimensionality). In such cases, techniques like target encoding or embeddings are better.

Data Science in my style