You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This final lesson ties together everything you have learned and provides a framework for approaching any data science project. We will cover the end-to-end data science workflow, best practices for project organisation, and the many directions you can take your learning from here.
Every data science project, regardless of domain or complexity, follows a similar lifecycle:
"What question are we trying to answer?"
This is the most important step. A poorly defined problem leads to wasted effort.
| Good question | Poor question |
|---|---|
| "Can we predict which customers will churn in the next 30 days?" | "Tell me something interesting about our customers" |
| "What factors most influence house prices in this region?" | "Analyse the housing data" |
| "Can we classify support tickets by urgency automatically?" | "Do some machine learning on our tickets" |
Key considerations:
Gather data from all relevant sources:
# Combine multiple data sources
customers = pd.read_csv('customers.csv')
orders = pd.read_sql('SELECT * FROM orders', engine)
web_logs = pd.read_json('web_activity.json')
# Merge into a single dataset
df = customers.merge(orders, on='customer_id', how='left')
df = df.merge(web_logs, on='customer_id', how='left')
Apply the techniques from Lesson 5:
# Missing values
df = handle_missing_values(df)
# Duplicates
df = df.drop_duplicates()
# Data types
df = fix_data_types(df)
# Outliers
df = handle_outliers(df)
# Feature engineering
df = create_features(df)
Apply the techniques from Lesson 7:
# Understand distributions
plot_distributions(df)
# Examine correlations
plot_correlation_matrix(df)
# Investigate relationships
plot_bivariate_analysis(df, target='churn')
Apply the techniques from Lesson 8:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Try multiple models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}
for name, model in models.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
print(f"{name}: F1 = {scores.mean():.3f} (+/- {scores.std():.3f})")
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
# Train the best model
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]
# Comprehensive evaluation
print(classification_report(y_test, y_pred))
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.3f}")
# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
Present your findings clearly:
For production models:
import joblib
# Save the model
joblib.dump(best_model, 'churn_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
# Save feature list
with open('features.txt', 'w') as f:
for feature in X_train.columns:
f.write(f"{feature}\n")
A well-organised project structure:
my-data-science-project/
├── data/
│ ├── raw/ # Original, immutable data
│ ├── processed/ # Cleaned, transformed data
│ └── external/ # Third-party data
├── notebooks/
│ ├── 01-eda.ipynb
│ ├── 02-cleaning.ipynb
│ ├── 03-modelling.ipynb
│ └── 04-evaluation.ipynb
├── src/
│ ├── data/ # Data loading and processing scripts
│ ├── features/ # Feature engineering scripts
│ ├── models/ # Model training and prediction scripts
│ └── visualisation/ # Plotting scripts
├── models/ # Saved trained models
├── reports/
│ └── figures/ # Generated graphics
├── requirements.txt # Python dependencies
├── README.md # Project overview
└── .gitignore
Problem: Information from the test set leaks into the training process.
# WRONG — fitting scaler on all data
scaler.fit(X) # Includes test data!
X_scaled = scaler.transform(X)
# RIGHT — fit on training data only
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Problem: The model performs well on training data but poorly on new data.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.