train-test-split
Table of Contents
Description
In machine learning, we divide the dataset into training and testing sets to evaluate how well a model performs on unseen data. The training set is used to build the model, and the testing set is used to assess its performance.
This prevents overfitting and ensures that the model generalizes well.
Prerequisites
- Pandas and NumPy
- scikit-learn (sklearn) installed
- Understanding of machine learning workflow
Examples
Here's a simple example of a data science task using Python:
from sklearn.model_selection import train_test_split
import pandas as pd
# Sample data
data = pd.DataFrame({
'Age': [22, 25, 47, 52, 46, 56, 44, 34],
'Salary': [20000, 30000, 80000, 110000, 95000, 120000, 70000, 62000],
'Purchased': [0, 0, 1, 1, 1, 1, 0, 0]
})
# Splitting into features (X) and target (y)
X = data[['Age', 'Salary']]
y = data['Purchased']
# Perform train-test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Output the split
print("Training Features:\n", X_train)
print("\nTesting Features:\n", X_test)
🧠 random_state ensures reproducibility.
📏 test_size=0.2 means 20% of the data goes to testing, 80% to training.
Real-World Applications
Finance
Predicting credit score or loan defaults
Healthcare
Patient disease prediction
Marketing
Customer churn or conversion prediction
Where topic Is Applied
- All supervised learning pipelines
- Model validation and benchmarking
- Cross-validation techniques
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ To evaluate the model’s performance on unseen data and avoid overfitting.
➤ Common ratios include 80/20, 70/30, or 75/25. It depends on the dataset size.
➤ It sets the seed for random shuffling, ensuring reproducibility.
➤ The model may perform well on training data but poorly on new data (overfitting).
➤ Yes, but it's better to use cross-validation or a separate validation set for tuning.