You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Scikit-Learn (also written sklearn) is the most widely used machine learning library in Python. It provides simple, efficient tools for data mining and data analysis, built on NumPy, SciPy, and Matplotlib. Scikit-Learn implements a consistent API across all its algorithms, making it easy to experiment with different models.
Machine learning is a subset of artificial intelligence where algorithms learn patterns from data to make predictions or decisions.
| Type | Description | Example |
|---|---|---|
| Supervised Learning | Learn from labelled data (input-output pairs) | Predicting house prices, classifying emails as spam |
| Unsupervised Learning | Find patterns in unlabelled data | Customer segmentation, anomaly detection |
| Reinforcement Learning | Learn by interacting with an environment | Game playing, robotics (not covered by Scikit-Learn) |
| Task | Output | Example |
|---|---|---|
| Classification | Discrete category | Spam/not spam, dog/cat/bird |
| Regression | Continuous value | House price, temperature, revenue |
Every Scikit-Learn model follows the same pattern:
from sklearn.some_module import SomeModel
# 1. Create the model
model = SomeModel(hyperparameter=value)
# 2. Fit (train) the model
model.fit(X_train, y_train)
# 3. Predict
predictions = model.predict(X_test)
# 4. Evaluate
score = model.score(X_test, y_test)
This consistent API means you can swap models with a single line change.
The most fundamental rule: never evaluate a model on the data it was trained on.
from sklearn.model_selection import train_test_split
# Split data: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42 # For reproducibility
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a K-Nearest Neighbours classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predict
y_pred = knn.predict(X_test)
# Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train a linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Predict
y_pred = lr.predict(X_test)
# Evaluate
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
| Algorithm | When to use |
|---|---|
| Logistic Regression | Binary classification, interpretable model |
| K-Nearest Neighbours | Simple, no training phase, good baseline |
| Decision Tree | Interpretable, handles mixed data types |
| Random Forest | Robust, handles overfitting, good default choice |
| Support Vector Machine | High-dimensional data, clear margin of separation |
| Gradient Boosting | Best accuracy for tabular data (XGBoost, LightGBM) |
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.