train-test-split

Introduction Reading Time: 12 min

Table of Contents

Description

In machine learning, we divide the dataset into training and testing sets to evaluate how well a model performs on unseen data. The training set is used to build the model, and the testing set is used to assess its performance.
This prevents overfitting and ensures that the model generalizes well.

Prerequisites

  • Pandas and NumPy
  • scikit-learn (sklearn) installed
  • Understanding of machine learning workflow

Examples

Here's a simple example of a data science task using Python:


from sklearn.model_selection import train_test_split
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Age': [22, 25, 47, 52, 46, 56, 44, 34],
    'Salary': [20000, 30000, 80000, 110000, 95000, 120000, 70000, 62000],
    'Purchased': [0, 0, 1, 1, 1, 1, 0, 0]
})

# Splitting into features (X) and target (y)
X = data[['Age', 'Salary']]
y = data['Purchased']

# Perform train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the split
print("Training Features:\n", X_train)
print("\nTesting Features:\n", X_test)

          

🧠 random_state ensures reproducibility.
📏 test_size=0.2 means 20% of the data goes to testing, 80% to training.

Real-World Applications

Finance

Predicting credit score or loan defaults

Healthcare

Patient disease prediction

Marketing

Customer churn or conversion prediction

Where topic Is Applied

  • All supervised learning pipelines
  • Model validation and benchmarking
  • Cross-validation techniques

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ To evaluate the model’s performance on unseen data and avoid overfitting.

➤ Common ratios include 80/20, 70/30, or 75/25. It depends on the dataset size.

➤ It sets the seed for random shuffling, ensuring reproducibility.

➤ The model may perform well on training data but poorly on new data (overfitting).

➤ Yes, but it's better to use cross-validation or a separate validation set for tuning.