Decision Trees and Random Forests

Decision trees are one of the most intuitive and widely used machine learning algorithms. They model decisions as a tree-like structure of rules, making them easy to interpret and visualise. Random forests extend decision trees into a powerful ensemble method that reduces overfitting and improves accuracy.

Decision Trees

A decision tree is a flowchart-like structure where:

Each internal node represents a test on a feature (e.g., "Is age > 30?")
Each branch represents the outcome of that test
Each leaf node represents a prediction (class label or value)

How a Decision Tree Learns

The algorithm recursively splits the data by choosing the feature and threshold that best separates the classes (or reduces prediction error). At each node, it asks: "Which feature split produces the purest subgroups?"

Splitting Criteria

Criterion	Used For	Description
Gini Impurity	Classification	Measures the probability of misclassifying a randomly chosen element
Entropy (Information Gain)	Classification	Measures the amount of information gained by a split
MSE (Mean Squared Error)	Regression	Minimises the variance of target values in each split

Gini Impurity

Gini = 1 - sum(p_i^2) for each class i

A Gini of 0 means the node is pure (all samples belong to one class).

Decision Tree Example

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Train
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

# Visualise
plt.figure(figsize=(16, 8))
plot_tree(tree, feature_names=iris.feature_names,
          class_names=iris.target_names, filled=True, rounded=True)
plt.title('Decision Tree — Iris Dataset')
plt.show()

print(f"Accuracy: {tree.score(X_test, y_test):.2f}")

Advantages and Disadvantages of Decision Trees

Advantages	Disadvantages
Easy to understand and interpret	Prone to overfitting without pruning
No feature scaling required	Sensitive to small changes in data
Handles both numerical and categorical data	Can create biased trees with imbalanced data
Can capture non-linear relationships	Greedy algorithm — not globally optimal

Preventing Overfitting in Decision Trees

Technique	Parameter	Description
Max depth	`max_depth`	Limits how deep the tree can grow
Min samples split	`min_samples_split`	Minimum samples required to split a node
Min samples leaf	`min_samples_leaf`	Minimum samples required in a leaf node
Max features	`max_features`	Number of features considered for each split
Pruning	`ccp_alpha`	Cost-complexity pruning — removes branches that provide little benefit

Random Forests

A random forest is an ensemble of many decision trees. Each tree is trained on a different random subset of the data and features, and the final prediction is determined by voting (classification) or averaging (regression).

How Random Forests Work

Bootstrap sampling — Each tree is trained on a random sample of the data (with replacement)
Random feature selection — At each split, only a random subset of features is considered
Independent training — Each tree is trained independently
Aggregation — Predictions are combined by majority vote (classification) or mean (regression)

Decision Trees and Random Forests

Decision Trees and Random Forests

Decision Trees

How a Decision Tree Learns

Splitting Criteria

Gini Impurity

Decision Tree Example

Advantages and Disadvantages of Decision Trees

Preventing Overfitting in Decision Trees

Random Forests

How Random Forests Work

More in Data Science