You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Data cleaning is the process of detecting and correcting (or removing) inaccurate, incomplete, or inconsistent data. It is often said that data scientists spend 60-80% of their time cleaning data — and for good reason. The quality of your analysis and models depends entirely on the quality of your data. Garbage in, garbage out.
| Issue | Example |
|---|---|
| Missing values | Empty cells, NaN, None |
| Duplicates | Same record appearing multiple times |
| Inconsistent formatting | "London", "london", "LONDON" |
| Wrong data types | Numbers stored as strings |
| Outliers | A salary of $10,000,000 in a dataset of average incomes |
| Invalid values | Age = -5, date = "32/13/2024" |
| Inconsistent categories | "M", "Male", "male", "m" |
| Mixed units | Weights in both kg and lbs |
| Structural errors | Merged cells, extra whitespace, special characters |
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
'Age': [25, np.nan, 35, 28, np.nan],
'City': ['London', 'Paris', 'Berlin', None, 'Madrid'],
'Salary': [55000, 62000, np.nan, 51000, 48000]
})
# Check for missing values
print(df.isnull()) # Boolean mask
print(df.isnull().sum()) # Count per column
print(df.isnull().sum().sum()) # Total missing values
# Percentage of missing values
print(df.isnull().mean() * 100)
# Drop rows with any missing value
df.dropna()
# Drop rows where ALL values are missing
df.dropna(how='all')
# Drop rows with missing values in specific columns
df.dropna(subset=['Name', 'Salary'])
# Drop columns with missing values
df.dropna(axis=1)
# Drop rows with fewer than 3 non-null values
df.dropna(thresh=3)
# Fill with a constant
df['City'].fillna('Unknown')
# Fill with the mean
df['Age'].fillna(df['Age'].mean())
# Fill with the median (better for skewed data)
df['Salary'].fillna(df['Salary'].median())
# Fill with the mode (most common value)
df['City'].fillna(df['City'].mode()[0])
# Forward fill (use previous value)
df['Age'].ffill()
# Backward fill (use next value)
df['Age'].bfill()
# Fill different columns with different strategies
df = df.fillna({
'Age': df['Age'].median(),
'City': 'Unknown',
'Salary': df['Salary'].mean()
})
| Strategy | When to use |
|---|---|
| Drop | Small fraction of data is missing; missing values are random |
| Fill with mean/median | Numerical data; need to preserve dataset size |
| Fill with mode | Categorical data |
| Forward/backward fill | Time series data |
| Fill with domain knowledge | When you know what the value should be |
| Create a "missing" category | When missingness itself is informative |
# Check for duplicates
df.duplicated() # Boolean mask
df.duplicated().sum() # Count of duplicates
# Show duplicate rows
df[df.duplicated(keep=False)] # All duplicate rows
# Remove duplicates
df.drop_duplicates()
# Remove duplicates based on specific columns
df.drop_duplicates(subset=['Name', 'City'])
# Keep the last occurrence instead of the first
df.drop_duplicates(keep='last')
# Check current data types
df.dtypes
# Convert to numeric (coerce errors to NaN)
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
# Convert to string
df['PostCode'] = df['PostCode'].astype(str)
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce')
# Convert to category (saves memory for columns with few unique values)
df['City'] = df['City'].astype('category')
# Convert to boolean
df['IsActive'] = df['IsActive'].astype(bool)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.