Titanic Dataset Survival Analysis
Table of Contents
Description
The Titanic dataset is a binary classification problem used to predict survival outcomes based on passenger data like:
Age
Gender
Class
Fare
Embarked location
Siblings/spouses aboard (SibSp)
Parents/children aboard (Parch)
The goal is to predict whether a passenger survived (1) or not (0).
Prerequisites
- Python basics
- Pandas, NumPy, Matplotlib, Seaborn
- Scikit-learn
- Basic understanding of classification
Examples
Here's a simple example of a data science task using Python:
import warnings
warnings.filterwarnings("ignore")
# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Load dataset
df = sns.load_dataset('titanic') # Or use: pd.read_csv('path/to/titanic.csv')
# Preview dataset
print(df.head())
# Drop unneeded columns
df = df.drop(['deck', 'embark_town', 'alive', 'who', 'adult_male', 'class'], axis=1)
# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
# Encode categorical features
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex']) # male:1, female:0
df['embarked'] = le.fit_transform(df['embarked'])
# Drop rows with missing 'embarked' or 'fare'
df.dropna(subset=['fare'], inplace=True)
# Define features and target
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = df['survived']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Model training
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Evaluation
y_pred = model.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
🧠 SelectKBest selects top 'k' features based on a scoring function like chi2 or f_classif.
Real-World Applications
Education
Teaching EDA and classification basics
ML Training
Benchmarking models on structured data
Healthcare
Risk prediction (like patient survival)
Where topic Is Applied
- Logistic Regression
- Feature engineering
- Binary classification
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ A dataset containing passenger info used to predict survival.
➤ survived (1 if survived, 0 if not)
➤ Imputed age with median, embarked with mode.
➤ Logistic Regression, Decision Trees, Random Forest, XGBoost, etc.
➤ Gender, passenger class, and age are typically most important.