You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Data cleaning is the process of detecting and correcting (or removing) inaccurate, incomplete, or inconsistent data. It is often said that data scientists spend 60-80% of their time cleaning data — and for good reason. The quality of your analysis and models depends entirely on the quality of your data. Garbage in, garbage out.
| Issue | Example |
|---|---|
| Missing values | Empty cells, NaN, None |
| Duplicates | Same record appearing multiple times |
| Inconsistent formatting | "London", "london", "LONDON" |
| Wrong data types | Numbers stored as strings |
| Outliers | A salary of $10,000,000 in a dataset of average incomes |
| Invalid values | Age = -5, date = "32/13/2024" |
| Inconsistent categories | "M", "Male", "male", "m" |
| Mixed units | Weights in both kg and lbs |
| Structural errors | Merged cells, extra whitespace, special characters |
import pandas as pd
import numpy as np
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.