Home › Topics › Intro to ML-Ready Data › train-test-split

train-test-split

Introduction Reading Time: 12 min

Description
Prerequisites
Examples
Real-World Applications
Where topic Is Applied
Resources
Interview Questions

Description

In machine learning, we divide the dataset into training and testing sets to evaluate how well a model performs on unseen data. The training set is used to build the model, and the testing set is used to assess its performance.
This prevents overfitting and ensures that the model generalizes well.

Prerequisites

Pandas and NumPy
scikit-learn (sklearn) installed
Understanding of machine learning workflow

Examples

Here's a simple example of a data science task using Python:


from sklearn.model_selection import train_test_split
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Age': [22, 25, 47, 52, 46, 56, 44, 34],
    'Salary': [20000, 30000, 80000, 110000, 95000, 120000, 70000, 62000],
    'Purchased': [0, 0, 1, 1, 1, 1, 0, 0]
})

# Splitting into features (X) and target (y)
X = data[['Age', 'Salary']]
y = data['Purchased']

# Perform train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the split
print("Training Features:\n", X_train)
print("\nTesting Features:\n", X_test)

🧠 random_state ensures reproducibility.
📏 test_size=0.2 means 20% of the data goes to testing, 80% to training.

Real-World Applications

Finance

Predicting credit score or loan defaults

Healthcare

Patient disease prediction

Marketing

Customer churn or conversion prediction

Where topic Is Applied

All supervised learning pipelines

Model validation and benchmarking

Cross-validation techniques

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ To evaluate the model’s performance on unseen data and avoid overfitting.

➤ Common ratios include 80/20, 70/30, or 75/25. It depends on the dataset size.

➤ It sets the seed for random shuffling, ensuring reproducibility.

➤ The model may perform well on training data but poorly on new data (overfitting).

➤ Yes, but it's better to use cross-validation or a separate validation set for tuning.

Data Science in my style