You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Real-world data is messy. Missing values, inconsistent formats, duplicates, and outliers are the norm rather than the exception. Data engineers spend a significant portion of their time cleaning and transforming data before it can be used. This lesson covers the essential techniques.
Raw Data Clean Data
┌──────────────────┐ ┌──────────────────┐
│ Missing values │ │ Complete records │
│ Wrong types │──────▶│ Correct types │
│ Duplicates │ │ No duplicates │
│ Inconsistent fmt │ │ Standardised │
│ Outliers │ │ Validated ranges │
└──────────────────┘ └──────────────────┘
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice", None, "Charlie", "Diana"],
"age": [30, 25, None, 28],
"email": ["alice@co.com", "bob@co.com", None, "diana@co.com"],
"salary": [90000, np.nan, 105000, np.nan],
})
# Count missing values per column
print(df.isnull().sum())
# name 1
# age 1
# email 1
# salary 2
# Percentage missing
print((df.isnull().sum() / len(df) * 100).round(1))
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.