You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Building a successful machine learning system requires more than just training a model. It involves a structured workflow from problem definition to deployment, along with best practices that prevent common mistakes and ensure reliable, reproducible results. This lesson brings everything together into a complete ML workflow.
| Step | Description | Key Activities |
|---|---|---|
| 1 | Define the Problem | Understand the business question, define success metrics |
| 2 | Collect Data | Gather relevant data from databases, APIs, files |
| 3 | Explore Data (EDA) | Visualise distributions, correlations, missing values |
| 4 | Preprocess Data | Clean, impute, scale, encode features |
| 5 | Engineer Features | Create new features, select important ones |
| 6 | Train Models | Try multiple algorithms, use cross-validation |
| 7 | Evaluate Models | Compare metrics, analyse errors |
| 8 | Tune Hyperparameters | Grid search, randomised search |
| 9 | Deploy Model | Serve predictions in production |
| 10 | Monitor and Maintain | Track performance, retrain when needed |
Before writing any code, answer these questions:
Tip: A model that improves a business metric by even 1% can be extremely valuable. Always define a measurable success criterion before starting.
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Basic information
print(df.shape)
print(df.info())
print(df.describe())
# Missing values
print(df.isnull().sum())
# Target distribution
print(df['target'].value_counts(normalize=True))
# Correlations
print(df.corr()['target'].sort_values(ascending=False))
A complete ML pipeline automates the entire workflow from raw data to predictions:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Define column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['employment_type', 'education']
# Preprocessing
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=200, random_state=42))
])
# Cross-validation
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"CV F1: {scores.mean():.4f} (+/- {scores.std():.4f})")
Always try multiple algorithms and compare their performance:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
RandomForestClassifier, GradientBoostingClassifier
)
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=200, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=200, random_state=42),
'SVM': SVC(kernel='rbf', random_state=42),
}
for name, model in models.items():
pipe = Pipeline([
('preprocessor', preprocessor),
('model', model)
])
scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring='f1')
print(f"{name:25s} F1: {scores.mean():.4f} (+/- {scores.std():.4f})")
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'classifier__n_estimators': randint(100, 500),
'classifier__max_depth': randint(3, 30),
'classifier__min_samples_split': randint(2, 20),
'classifier__min_samples_leaf': randint(1, 10),
}
search = RandomizedSearchCV(
pipeline, param_distributions, n_iter=50,
cv=5, scoring='f1', random_state=42, n_jobs=-1
)
search.fit(X_train, y_train)
print(f"Best F1: {search.best_score_:.4f}")
print(f"Best params: {search.best_params_}")
After tuning, evaluate the best model on the held-out test set (used only once):
from sklearn.metrics import classification_report, roc_auc_score
best_model = search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
# If the model supports probabilities
if hasattr(best_model, 'predict_proba'):
y_proba = best_model.predict_proba(X_test)[:, 1]
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.4f}")
import joblib
# Save the entire pipeline (preprocessing + model)
joblib.dump(best_model, 'model_pipeline.pkl')
# Load it later
loaded_model = joblib.load('model_pipeline.pkl')
predictions = loaded_model.predict(new_data)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.