Data Cleaning & Transformation

Real-world data is messy. Missing values, inconsistent formats, duplicates, and outliers are the norm rather than the exception. Data engineers spend a significant portion of their time cleaning and transforming data before it can be used. This lesson covers the essential techniques.

The Data Quality Problem

Raw Data                    Clean Data
┌──────────────────┐       ┌──────────────────┐
│ Missing values   │       │ Complete records  │
│ Wrong types      │──────▶│ Correct types     │
│ Duplicates       │       │ No duplicates     │
│ Inconsistent fmt │       │ Standardised      │
│ Outliers         │       │ Validated ranges  │
└──────────────────┘       └──────────────────┘

Handling Missing Values

Detecting Missing Data

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name": ["Alice", None, "Charlie", "Diana"],
    "age": [30, 25, None, 28],
    "email": ["alice@co.com", "bob@co.com", None, "diana@co.com"],
    "salary": [90000, np.nan, 105000, np.nan],
})

# Count missing values per column
print(df.isnull().sum())
# name      1
# age       1
# email     1
# salary    2

# Percentage missing
print((df.isnull().sum() / len(df) * 100).round(1))

Strategies for Missing Values

Strategy	When to Use	Code
Drop rows	Few missing, not critical	`df.dropna(subset=["email"])`
Fill with value	Known default makes sense	`df["salary"].fillna(0)`
Fill with mean	Numeric column, normal distribution	`df["salary"].fillna(df["salary"].mean())`
Forward fill	Time series data	`df["price"].ffill()`
Interpolate	Time series with trends	`df["price"].interpolate()`

# Combined approach
cleaned = (
    df
    .dropna(subset=["name", "email"])       # Must have name and email
    .assign(
        age=lambda x: x["age"].fillna(x["age"].median()),
        salary=lambda x: x["salary"].fillna(x["salary"].mean()),
    )
)

Data Type Conversion

Incorrect types are one of the most common data issues.

# Check current types
print(df.dtypes)

# Convert types explicitly
df = df.astype({
    "id": int,
    "price": float,
    "is_active": bool,
})

# Numeric conversion with error handling
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")  # invalid -> NaN

# Date conversion
df["created_at"] = pd.to_datetime(df["created_at"], format="%Y-%m-%d")

# Category type (saves memory for low-cardinality columns)
df["status"] = df["status"].astype("category")

Type Conversion Cheat Sheet

From	To	Method
`str`	`int`	`pd.to_numeric(s, errors="coerce")`
`str`	`float`	`pd.to_numeric(s, errors="coerce")`
`str`	`datetime`	`pd.to_datetime(s)`
`str`	`bool`	`s.map({"true": True, "false": False})`
`object`	`category`	`s.astype("category")`

String Operations

Pandas provides vectorised string operations through the .str accessor.

# Standardise email addresses
df["email"] = df["email"].str.lower().str.strip()

# Extract domain from email
df["domain"] = df["email"].str.split("@").str[1]

# Clean phone numbers — keep only digits
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True)

# Standardise names
df["name"] = df["name"].str.title().str.strip()

# Check for patterns
df["has_gmail"] = df["email"].str.contains("gmail", case=False, na=False)

Data Cleaning & Transformation

Data Cleaning & Transformation

The Data Quality Problem

Handling Missing Values

Detecting Missing Data

Strategies for Missing Values

Data Type Conversion

Type Conversion Cheat Sheet

String Operations

Common String Methods

More in Programming