Feature Engineering and Preprocessing

Feature engineering and preprocessing are among the most impactful steps in a machine learning project. The quality and representation of your features often matter more than the choice of algorithm. Better features lead to better models — even a simple algorithm with well-engineered features can outperform a complex algorithm with raw, unprocessed data.

Why Feature Engineering Matters

Aspect	Impact
Model performance	Good features capture the true signal in data, improving accuracy
Training speed	Fewer, better features mean faster training
Interpretability	Meaningful features make models easier to understand
Generalisation	Proper preprocessing prevents data leakage and overfitting

Data Preprocessing

Handling Missing Values

Strategy	When to Use	Scikit-Learn Class
Drop rows	Few missing values, large dataset	`DataFrame.dropna()`
Mean/Median imputation	Numerical features, missing at random	`SimpleImputer(strategy='mean')`
Mode imputation	Categorical features	`SimpleImputer(strategy='most_frequent')`
KNN imputation	Values related to nearby data points	`KNNImputer(n_neighbors=5)`
Iterative imputation	Complex relationships between features	`IterativeImputer()`

from sklearn.impute import SimpleImputer, KNNImputer
import numpy as np

# Mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_train)

# KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_knn_imputed = knn_imputer.fit_transform(X_train)

Feature Scaling

Many algorithms (SVM, KNN, neural networks, gradient descent-based methods) are sensitive to the scale of features. A feature ranging from 0 to 1000 will dominate a feature ranging from 0 to 1.

Scaling Methods

Method	Formula	Range	When to Use
StandardScaler	(x - mean) / std	Unbounded (mean=0, std=1)	General purpose, most algorithms
MinMaxScaler	(x - min) / (max - min)	[0, 1]	When you need bounded values (e.g., neural networks)
RobustScaler	(x - median) / IQR	Unbounded	Data with outliers
MaxAbsScaler	x / max(abs(x))	[-1, 1]	Sparse data

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Use the same scaler!

# MinMax scaling
minmax = MinMaxScaler()
X_train_minmax = minmax.fit_transform(X_train)

Tip: Always fit the scaler on training data only, then use transform() on the test data. Fitting on the test data causes data leakage.

Encoding Categorical Variables

Machine learning algorithms require numerical input. Categorical variables must be encoded.

Encoding Methods

Method	Description	When to Use
Label Encoding	Assigns integer to each category (0, 1, 2, ...)	Ordinal categories (e.g., low, medium, high)
One-Hot Encoding	Creates a binary column for each category	Nominal categories (e.g., colour, country)
Ordinal Encoding	Maps categories to ordered integers	When there is a natural order
Target Encoding	Replaces category with mean of the target	High-cardinality features

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder

# One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False, drop='first')
X_encoded = ohe.fit_transform(X_categorical)

# Ordinal Encoding (for ordered categories)
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_ordinal = oe.fit_transform(X_ordinal_categorical)

Feature Creation

Creating new features from existing data can significantly improve model performance.

Feature Engineering and Preprocessing

Feature Engineering and Preprocessing

Why Feature Engineering Matters

Data Preprocessing

Handling Missing Values

Feature Scaling

Scaling Methods

Encoding Categorical Variables

Encoding Methods

Feature Creation

Common Feature Engineering Techniques

More in Data Science