You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Real-world data is messy. Missing values, inconsistent formats, duplicates, and outliers are the norm rather than the exception. Data engineers spend a significant portion of their time cleaning and transforming data before it can be used. This lesson covers the essential techniques.
Raw Data Clean Data
┌──────────────────┐ ┌──────────────────┐
│ Missing values │ │ Complete records │
│ Wrong types │──────▶│ Correct types │
│ Duplicates │ │ No duplicates │
│ Inconsistent fmt │ │ Standardised │
│ Outliers │ │ Validated ranges │
└──────────────────┘ └──────────────────┘
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice", None, "Charlie", "Diana"],
"age": [30, 25, None, 28],
"email": ["alice@co.com", "bob@co.com", None, "diana@co.com"],
"salary": [90000, np.nan, 105000, np.nan],
})
# Count missing values per column
print(df.isnull().sum())
# name 1
# age 1
# email 1
# salary 2
# Percentage missing
print((df.isnull().sum() / len(df) * 100).round(1))
| Strategy | When to Use | Code |
|---|---|---|
| Drop rows | Few missing, not critical | df.dropna(subset=["email"]) |
| Fill with value | Known default makes sense | df["salary"].fillna(0) |
| Fill with mean | Numeric column, normal distribution | df["salary"].fillna(df["salary"].mean()) |
| Forward fill | Time series data | df["price"].ffill() |
| Interpolate | Time series with trends | df["price"].interpolate() |
# Combined approach
cleaned = (
df
.dropna(subset=["name", "email"]) # Must have name and email
.assign(
age=lambda x: x["age"].fillna(x["age"].median()),
salary=lambda x: x["salary"].fillna(x["salary"].mean()),
)
)
Incorrect types are one of the most common data issues.
# Check current types
print(df.dtypes)
# Convert types explicitly
df = df.astype({
"id": int,
"price": float,
"is_active": bool,
})
# Numeric conversion with error handling
df["amount"] = pd.to_numeric(df["amount"], errors="coerce") # invalid -> NaN
# Date conversion
df["created_at"] = pd.to_datetime(df["created_at"], format="%Y-%m-%d")
# Category type (saves memory for low-cardinality columns)
df["status"] = df["status"].astype("category")
| From | To | Method |
|---|---|---|
str | int | pd.to_numeric(s, errors="coerce") |
str | float | pd.to_numeric(s, errors="coerce") |
str | datetime | pd.to_datetime(s) |
str | bool | s.map({"true": True, "false": False}) |
object | category | s.astype("category") |
Pandas provides vectorised string operations through the .str accessor.
# Standardise email addresses
df["email"] = df["email"].str.lower().str.strip()
# Extract domain from email
df["domain"] = df["email"].str.split("@").str[1]
# Clean phone numbers — keep only digits
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True)
# Standardise names
df["name"] = df["name"].str.title().str.strip()
# Check for patterns
df["has_gmail"] = df["email"].str.contains("gmail", case=False, na=False)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.