feature-selection

Introduction Reading Time: 12 min

Table of Contents

Description

Feature Selection is the process of selecting the most relevant and important features (input variables) from the dataset that contribute significantly to predicting the target variable.
The goal is to:
Improve model performance (accuracy, speed)
Reduce overfitting
Enhance generalization
Make models easier to interpret

Prerequisites

  • Basic understanding of machine learning
  • Familiarity with pandas and NumPy
  • Train-test split and model evaluation techniques

Examples

Here's a simple example of a data science task using Python:


from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2
import pandas as pd

# Load sample dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Select top 2 features using chi-squared test
selector = SelectKBest(score_func=chi2, k=2)
X_new = selector.fit_transform(X, y)

# Print selected features
selected_columns = X.columns[selector.get_support()]
print("Selected Features:", selected_columns.tolist())
          

🧠 SelectKBest selects top 'k' features based on a scoring function like chi2 or f_classif.

Real-World Applications

Finance

Identifying most relevant variables affecting credit score

Healthcare

Selecting biomarkers for disease prediction

Marketing

Identifying key demographics for targeting ads

Where topic Is Applied

  • Preprocessing phase of ML pipelines
  • Dimensionality reduction
  • Model tuning and performance improvement

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ It is selecting the most useful variables for a model, improving performance and reducing complexity.

➤ Filter methods (correlation, chi2), wrapper methods (RFE), embedded methods (Lasso, Tree-based).

➤ Feature selection selects a subset of original features, while dimensionality reduction transforms features (e.g., PCA).

➤ By removing irrelevant or noisy features, it reduces the model's complexity.

➤ A wrapper-based method that recursively removes least important features based on model performance.