You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Feature engineering and preprocessing are among the most impactful steps in a machine learning project. The quality and representation of your features often matter more than the choice of algorithm. Better features lead to better models — even a simple algorithm with well-engineered features can outperform a complex algorithm with raw, unprocessed data.
| Aspect | Impact |
|---|---|
| Model performance | Good features capture the true signal in data, improving accuracy |
| Training speed | Fewer, better features mean faster training |
| Interpretability | Meaningful features make models easier to understand |
| Generalisation | Proper preprocessing prevents data leakage and overfitting |
| Strategy | When to Use | Scikit-Learn Class |
|---|---|---|
| Drop rows | Few missing values, large dataset | DataFrame.dropna() |
| Mean/Median imputation | Numerical features, missing at random | SimpleImputer(strategy='mean') |
| Mode imputation | Categorical features | SimpleImputer(strategy='most_frequent') |
| KNN imputation | Values related to nearby data points | KNNImputer(n_neighbors=5) |
| Iterative imputation | Complex relationships between features | IterativeImputer() |
from sklearn.impute import SimpleImputer, KNNImputer
import numpy as np
# Mean imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_train)
# KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
X_knn_imputed = knn_imputer.fit_transform(X_train)
Many algorithms (SVM, KNN, neural networks, gradient descent-based methods) are sensitive to the scale of features. A feature ranging from 0 to 1000 will dominate a feature ranging from 0 to 1.
| Method | Formula | Range | When to Use |
|---|---|---|---|
| StandardScaler | (x - mean) / std | Unbounded (mean=0, std=1) | General purpose, most algorithms |
| MinMaxScaler | (x - min) / (max - min) | [0, 1] | When you need bounded values (e.g., neural networks) |
| RobustScaler | (x - median) / IQR | Unbounded | Data with outliers |
| MaxAbsScaler | x / max(abs(x)) | [-1, 1] | Sparse data |
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# Standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Use the same scaler!
# MinMax scaling
minmax = MinMaxScaler()
X_train_minmax = minmax.fit_transform(X_train)
Tip: Always fit the scaler on training data only, then use
transform()on the test data. Fitting on the test data causes data leakage.
Machine learning algorithms require numerical input. Categorical variables must be encoded.
| Method | Description | When to Use |
|---|---|---|
| Label Encoding | Assigns integer to each category (0, 1, 2, ...) | Ordinal categories (e.g., low, medium, high) |
| One-Hot Encoding | Creates a binary column for each category | Nominal categories (e.g., colour, country) |
| Ordinal Encoding | Maps categories to ordered integers | When there is a natural order |
| Target Encoding | Replaces category with mean of the target | High-cardinality features |
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
# One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False, drop='first')
X_encoded = ohe.fit_transform(X_categorical)
# Ordinal Encoding (for ordered categories)
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_ordinal = oe.fit_transform(X_ordinal_categorical)
Creating new features from existing data can significantly improve model performance.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.