Introduction to Scikit-Learn

Scikit-Learn (also written sklearn) is the most widely used machine learning library in Python. It provides simple, efficient tools for data mining and data analysis, built on NumPy, SciPy, and Matplotlib. Scikit-Learn implements a consistent API across all its algorithms, making it easy to experiment with different models.

Machine Learning Basics

Machine learning is a subset of artificial intelligence where algorithms learn patterns from data to make predictions or decisions.

Types of Machine Learning

Type	Description	Example
Supervised Learning	Learn from labelled data (input-output pairs)	Predicting house prices, classifying emails as spam
Unsupervised Learning	Find patterns in unlabelled data	Customer segmentation, anomaly detection
Reinforcement Learning	Learn by interacting with an environment	Game playing, robotics (not covered by Scikit-Learn)

Supervised Learning Tasks

Task	Output	Example
Classification	Discrete category	Spam/not spam, dog/cat/bird
Regression	Continuous value	House price, temperature, revenue

The Scikit-Learn API

Every Scikit-Learn model follows the same pattern:

from sklearn.some_module import SomeModel

# 1. Create the model
model = SomeModel(hyperparameter=value)

# 2. Fit (train) the model
model.fit(X_train, y_train)

# 3. Predict
predictions = model.predict(X_test)

# 4. Evaluate
score = model.score(X_test, y_test)

This consistent API means you can swap models with a single line change.

Train-Test Split

The most fundamental rule: never evaluate a model on the data it was trained on.

from sklearn.model_selection import train_test_split

# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42  # For reproducibility
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Classification Example: Iris Dataset

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a K-Nearest Neighbours classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Regression Example: Boston-style Housing

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train a linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict
y_pred = lr.predict(X_test)

# Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")

Common Algorithms

Classification

Algorithm	When to use
Logistic Regression	Binary classification, interpretable model
K-Nearest Neighbours	Simple, no training phase, good baseline
Decision Tree	Interpretable, handles mixed data types
Random Forest	Robust, handles overfitting, good default choice
Support Vector Machine	High-dimensional data, clear margin of separation
Gradient Boosting	Best accuracy for tabular data (XGBoost, LightGBM)

Introduction to Scikit-Learn

Introduction to Scikit-Learn

Machine Learning Basics

Types of Machine Learning

Supervised Learning Tasks

The Scikit-Learn API

Train-Test Split

Classification Example: Iris Dataset

Regression Example: Boston-style Housing

Common Algorithms

Classification

Regression

More in Data Science