Data Science Workflow and Next Steps

This final lesson ties together everything you have learned and provides a framework for approaching any data science project. We will cover the end-to-end data science workflow, best practices for project organisation, and the many directions you can take your learning from here.

The Data Science Workflow

Every data science project, regardless of domain or complexity, follows a similar lifecycle:

1. Problem Definition

"What question are we trying to answer?"

This is the most important step. A poorly defined problem leads to wasted effort.

Good question	Poor question
"Can we predict which customers will churn in the next 30 days?"	"Tell me something interesting about our customers"
"What factors most influence house prices in this region?"	"Analyse the housing data"
"Can we classify support tickets by urgency automatically?"	"Do some machine learning on our tickets"

Key considerations:

What decision will this analysis inform?
What does success look like? How will we measure it?
What data is available? What data would we need?
What are the constraints (time, budget, privacy)?

2. Data Collection

Gather data from all relevant sources:

# Combine multiple data sources
customers = pd.read_csv('customers.csv')
orders = pd.read_sql('SELECT * FROM orders', engine)
web_logs = pd.read_json('web_activity.json')

# Merge into a single dataset
df = customers.merge(orders, on='customer_id', how='left')
df = df.merge(web_logs, on='customer_id', how='left')

3. Data Cleaning and Preparation

Apply the techniques from Lesson 5:

# Missing values
df = handle_missing_values(df)

# Duplicates
df = df.drop_duplicates()

# Data types
df = fix_data_types(df)

# Outliers
df = handle_outliers(df)

# Feature engineering
df = create_features(df)

4. Exploratory Data Analysis

Apply the techniques from Lesson 7:

# Understand distributions
plot_distributions(df)

# Examine correlations
plot_correlation_matrix(df)

# Investigate relationships
plot_bivariate_analysis(df, target='churn')

5. Modelling

Apply the techniques from Lesson 8:

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Try multiple models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}

for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
    print(f"{name}: F1 = {scores.mean():.3f} (+/- {scores.std():.3f})")

6. Evaluation

from sklearn.metrics import classification_report, roc_auc_score, roc_curve

# Train the best model
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

# Comprehensive evaluation
print(classification_report(y_test, y_pred))
print(f"ROC AUC: {roc_auc_score(y_test, y_proba):.3f}")

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, y_proba):.3f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

7. Communication

Present your findings clearly:

Executive summary — one paragraph of key findings
Methodology — what you did and why
Results — visualisations and metrics
Recommendations — actionable next steps
Limitations — what the analysis does not cover

8. Deployment

For production models:

import joblib

# Save the model
joblib.dump(best_model, 'churn_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

# Save feature list
with open('features.txt', 'w') as f:
    for feature in X_train.columns:
        f.write(f"{feature}\n")

Project Organisation

A well-organised project structure:

my-data-science-project/
├── data/
│   ├── raw/              # Original, immutable data
│   ├── processed/        # Cleaned, transformed data
│   └── external/         # Third-party data
├── notebooks/
│   ├── 01-eda.ipynb
│   ├── 02-cleaning.ipynb
│   ├── 03-modelling.ipynb
│   └── 04-evaluation.ipynb
├── src/
│   ├── data/             # Data loading and processing scripts
│   ├── features/         # Feature engineering scripts
│   ├── models/           # Model training and prediction scripts
│   └── visualisation/    # Plotting scripts
├── models/               # Saved trained models
├── reports/
│   └── figures/          # Generated graphics
├── requirements.txt      # Python dependencies
├── README.md             # Project overview
└── .gitignore

Common Pitfalls

1. Data Leakage

Problem: Information from the test set leaks into the training process.

# WRONG — fitting scaler on all data
scaler.fit(X)  # Includes test data!
X_scaled = scaler.transform(X)

# RIGHT — fit on training data only
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Overfitting

Problem: The model performs well on training data but poorly on new data.

Data Science Workflow and Next Steps

Data Science Workflow and Next Steps

The Data Science Workflow

1. Problem Definition

2. Data Collection

3. Data Cleaning and Preparation

4. Exploratory Data Analysis

5. Modelling

6. Evaluation

7. Communication

8. Deployment

Project Organisation

Common Pitfalls

1. Data Leakage

2. Overfitting

More in Data Science