Titanic Dataset Survival Analysis

Introduction Reading Time: 12 min

Table of Contents

Description

The Titanic dataset is a binary classification problem used to predict survival outcomes based on passenger data like:
Age
Gender
Class
Fare
Embarked location
Siblings/spouses aboard (SibSp)
Parents/children aboard (Parch)
The goal is to predict whether a passenger survived (1) or not (0).

Prerequisites

  • Python basics
  • Pandas, NumPy, Matplotlib, Seaborn
  • Scikit-learn
  • Basic understanding of classification

Examples

Here's a simple example of a data science task using Python:


import warnings

warnings.filterwarnings("ignore")
# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Load dataset
df = sns.load_dataset('titanic')  # Or use: pd.read_csv('path/to/titanic.csv')

# Preview dataset
print(df.head())

# Drop unneeded columns
df = df.drop(['deck', 'embark_town', 'alive', 'who', 'adult_male', 'class'], axis=1)

# Handle missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)

# Encode categorical features
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])  # male:1, female:0
df['embarked'] = le.fit_transform(df['embarked'])

# Drop rows with missing 'embarked' or 'fare'
df.dropna(subset=['fare'], inplace=True)

# Define features and target
X = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]
y = df['survived']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model training
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Evaluation
y_pred = model.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

          

🧠 SelectKBest selects top 'k' features based on a scoring function like chi2 or f_classif.

Real-World Applications

Education

Teaching EDA and classification basics

ML Training

Benchmarking models on structured data

Healthcare

Risk prediction (like patient survival)

Where topic Is Applied

  • Logistic Regression
  • Feature engineering
  • Binary classification

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ A dataset containing passenger info used to predict survival.

➤ survived (1 if survived, 0 if not)

➤ Imputed age with median, embarked with mode.

➤ Logistic Regression, Decision Trees, Random Forest, XGBoost, etc.

➤ Gender, passenger class, and age are typically most important.