You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Bad data is worse than no data — it leads to wrong decisions, broken dashboards, and lost trust. Data quality must be validated at every stage of your pipeline. This lesson covers schema validation with Pydantic, data quality checks with Great Expectations, data contracts, and monitoring.
┌──────────────────────────────────────────────────┐
│ Bad data entered at the source │
│ ↓ │
│ Propagates through the pipeline │
│ ↓ │
│ Loads into the warehouse │
│ ↓ │
│ Feeds into dashboards and ML models │
│ ↓ │
│ Business makes wrong decisions │
│ ↓ │
│ Hours or days to find and fix the root cause │
└──────────────────────────────────────────────────┘
Rule of thumb: Fix data quality issues as early as possible — the cost of fixing grows exponentially as data moves downstream.
Pydantic validates data at runtime, catching type errors, missing fields, and constraint violations.
from pydantic import BaseModel, Field, field_validator, EmailStr
from datetime import date
from typing import Optional
class CustomerRecord(BaseModel):
id: int = Field(gt=0, description="Unique customer ID")
name: str = Field(min_length=1, max_length=255)
email: EmailStr
signup_date: date
total_orders: int = Field(ge=0, default=0)
lifetime_value: float = Field(ge=0.0, default=0.0)
country_code: Optional[str] = Field(None, pattern=r"^[A-Z]{2}$")
@field_validator("name")
@classmethod
def name_must_not_be_empty(cls, v: str) -> str:
if not v.strip():
raise ValueError("Name must not be blank")
return v.strip().title()
import pandas as pd
from pydantic import ValidationError
def validate_dataframe(
df: pd.DataFrame,
model: type[BaseModel],
) -> tuple[list[dict], list[dict]]:
"""Validate each row against a Pydantic model."""
valid_rows = []
invalid_rows = []
for idx, row in df.iterrows():
try:
record = model(**row.to_dict())
valid_rows.append(record.model_dump())
except ValidationError as e:
invalid_rows.append({
"row_index": idx,
"data": row.to_dict(),
"errors": e.errors(),
})
return valid_rows, invalid_rows
# Usage
valid, invalid = validate_dataframe(raw_df, CustomerRecord)
print(f"Valid: {len(valid)}, Invalid: {len(invalid)}")
Great Expectations is a Python framework for defining and running data quality checks (called "expectations").
import great_expectations as gx
# Create a context
context = gx.get_context()
# Create a data source from a DataFrame
data_source = context.data_sources.pandas_default
data_asset = data_source.add_dataframe_asset(name="customers")
batch = data_asset.build_batch_request(dataframe=df)
# Define expectations
validator = context.get_validator(batch_request=batch)
validator.expect_column_to_exist("email")
validator.expect_column_values_to_not_be_null("email")
validator.expect_column_values_to_be_unique("email")
validator.expect_column_values_to_match_regex("email", r"^[^@]+@[^@]+\.[^@]+$")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=150)
validator.expect_column_values_to_be_in_set("status", ["active", "inactive", "suspended"])
# Run and check results
results = validator.validate()
print(f"Success: {results.success}")
| Expectation | What It Checks |
|---|---|
expect_column_to_exist | Column is present |
expect_column_values_to_not_be_null | No null values |
expect_column_values_to_be_unique | All values are distinct |
expect_column_values_to_be_between | Values in a numeric range |
expect_column_values_to_be_in_set | Values are from an allowed set |
expect_column_values_to_match_regex | Values match a pattern |
expect_table_row_count_to_be_between | Table has expected number of rows |
A data contract is a formal agreement between data producers and consumers about the schema, quality, and semantics of the data.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.