Data Quality & Validation

Bad data is worse than no data — it leads to wrong decisions, broken dashboards, and lost trust. Data quality must be validated at every stage of your pipeline. This lesson covers schema validation with Pydantic, data quality checks with Great Expectations, data contracts, and monitoring.

The Cost of Bad Data

┌──────────────────────────────────────────────────┐
│ Bad data entered at the source                   │
│         ↓                                        │
│ Propagates through the pipeline                  │
│         ↓                                        │
│ Loads into the warehouse                         │
│         ↓                                        │
│ Feeds into dashboards and ML models              │
│         ↓                                        │
│ Business makes wrong decisions                   │
│         ↓                                        │
│ Hours or days to find and fix the root cause     │
└──────────────────────────────────────────────────┘

Rule of thumb: Fix data quality issues as early as possible — the cost of fixing grows exponentially as data moves downstream.

Schema Validation with Pydantic

Pydantic validates data at runtime, catching type errors, missing fields, and constraint violations.

from pydantic import BaseModel, Field, field_validator, EmailStr
from datetime import date
from typing import Optional

class CustomerRecord(BaseModel):
    id: int = Field(gt=0, description="Unique customer ID")
    name: str = Field(min_length=1, max_length=255)
    email: EmailStr
    signup_date: date
    total_orders: int = Field(ge=0, default=0)
    lifetime_value: float = Field(ge=0.0, default=0.0)
    country_code: Optional[str] = Field(None, pattern=r"^[A-Z]{2}$")

    @field_validator("name")
    @classmethod
    def name_must_not_be_empty(cls, v: str) -> str:
        if not v.strip():
            raise ValueError("Name must not be blank")
        return v.strip().title()

Validating Rows in a Pipeline

import pandas as pd
from pydantic import ValidationError

def validate_dataframe(
    df: pd.DataFrame,
    model: type[BaseModel],
) -> tuple[list[dict], list[dict]]:
    """Validate each row against a Pydantic model."""
    valid_rows = []
    invalid_rows = []

    for idx, row in df.iterrows():
        try:
            record = model(**row.to_dict())
            valid_rows.append(record.model_dump())
        except ValidationError as e:
            invalid_rows.append({
                "row_index": idx,
                "data": row.to_dict(),
                "errors": e.errors(),
            })

    return valid_rows, invalid_rows

# Usage
valid, invalid = validate_dataframe(raw_df, CustomerRecord)
print(f"Valid: {len(valid)}, Invalid: {len(invalid)}")

Data Quality Checks with Great Expectations

Great Expectations is a Python framework for defining and running data quality checks (called "expectations").

import great_expectations as gx

# Create a context
context = gx.get_context()

# Create a data source from a DataFrame
data_source = context.data_sources.pandas_default
data_asset = data_source.add_dataframe_asset(name="customers")
batch = data_asset.build_batch_request(dataframe=df)

# Define expectations
validator = context.get_validator(batch_request=batch)

validator.expect_column_to_exist("email")
validator.expect_column_values_to_not_be_null("email")
validator.expect_column_values_to_be_unique("email")
validator.expect_column_values_to_match_regex("email", r"^[^@]+@[^@]+\.[^@]+$")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=150)
validator.expect_column_values_to_be_in_set("status", ["active", "inactive", "suspended"])

# Run and check results
results = validator.validate()
print(f"Success: {results.success}")

Common Expectations

Expectation	What It Checks
`expect_column_to_exist`	Column is present
`expect_column_values_to_not_be_null`	No null values
`expect_column_values_to_be_unique`	All values are distinct
`expect_column_values_to_be_between`	Values in a numeric range
`expect_column_values_to_be_in_set`	Values are from an allowed set
`expect_column_values_to_match_regex`	Values match a pattern
`expect_table_row_count_to_be_between`	Table has expected number of rows

Data Contracts

A data contract is a formal agreement between data producers and consumers about the schema, quality, and semantics of the data.

Data Quality & Validation

Data Quality & Validation

The Cost of Bad Data

Schema Validation with Pydantic

Validating Rows in a Pipeline

Data Quality Checks with Great Expectations

Common Expectations

Data Contracts

More in Programming