Python Data Engineering Fundamentals

Data engineering is the discipline of designing, building, and maintaining systems that collect, store, and transform data for analysis and machine learning. Python has become the dominant language in this space thanks to its rich ecosystem and readability. This lesson covers the Python ecosystem for data engineering, key libraries, project setup, and how to use type hints effectively in data code.

Why Python for Data Engineering?

Python dominates data engineering for several reasons:

Rich ecosystem — hundreds of mature libraries for every data task
Readability — data pipelines are read far more than they are written
Community — massive community means excellent documentation and support
Interoperability — bridges nicely to SQL, Spark, cloud SDKs, and REST APIs
Rapid prototyping — go from idea to working pipeline quickly

The Data Engineering Landscape

graph TD
  SRC["DATA SOURCES: APIs / Databases / Files / Streams / Webhooks"]
  SRC --> EXTRACT["EXTRACT (requests, sqlalchemy, boto3)"]
  SRC --> ORCH["ORCHESTRATION (Airflow / Prefect): scheduling, retries"]
  EXTRACT --> TRANSFORM["TRANSFORM (pandas / polars, pydantic, great_expectations)"]
  TRANSFORM --> LOAD["LOAD (sqlalchemy, psycopg2, pyarrow)"]

Key Libraries Overview

Library	Purpose	Install Command
`pandas`	Data manipulation and analysis	`pip install pandas`
`polars`	Fast DataFrame library (Rust-backed)	`pip install polars`
`sqlalchemy`	Database toolkit and ORM	`pip install sqlalchemy`
`pydantic`	Data validation and settings	`pip install pydantic`
`pyarrow`	Parquet and Arrow file support	`pip install pyarrow`
`great_expectations`	Data quality validation	`pip install great_expectations`
`requests`	HTTP client for API extraction	`pip install requests`
`aiohttp`	Async HTTP client	`pip install aiohttp`
`prefect`	Workflow orchestration	`pip install prefect`
`pytest`	Testing framework	`pip install pytest`

Project Setup

Recommended Project Structure

graph TD
  ROOT["data-pipeline/"]
  ROOT --> ENV[".env  # secrets (git-ignored)"]
  ROOT --> GITIGNORE[".gitignore"]
  ROOT --> PYPROJECT["pyproject.toml  # dependencies and project metadata"]
  ROOT --> SRC["src/"]
  ROOT --> TESTS["tests/"]
  ROOT --> DAGS["dags/  # Airflow DAGs (if using)"]
  ROOT --> README["README.md"]
  SRC --> SRC_INIT["__init__.py"]
  SRC --> EXTRACT["extract/"]
  SRC --> TRANSFORM["transform/"]
  SRC --> LOAD["load/"]
  SRC --> MODELS["models/"]
  SRC --> PIPELINE["pipeline.py  # Main pipeline orchestration"]
  EXTRACT --> EXTRACT_INIT["__init__.py"]
  EXTRACT --> API_CLIENT["api_client.py  # API extraction logic"]
  EXTRACT --> DB_READER["db_reader.py  # Database extraction"]
  TRANSFORM --> TRANSFORM_INIT["__init__.py"]
  TRANSFORM --> CLEAN["clean.py  # Data cleaning functions"]
  TRANSFORM --> ENRICH["enrich.py  # Data enrichment"]
  LOAD --> LOAD_INIT["__init__.py"]
  LOAD --> WAREHOUSE["warehouse.py  # Load to destination"]
  MODELS --> MODELS_INIT["__init__.py"]
  MODELS --> SCHEMAS["schemas.py  # Pydantic models"]
  TESTS --> TESTS_INIT["__init__.py"]
  TESTS --> TEST_EXTRACT["test_extract.py"]
  TESTS --> TEST_TRANSFORM["test_transform.py"]
  TESTS --> FIXTURES["fixtures/"]
  FIXTURES --> SAMPLE["sample_data.csv"]
  DAGS --> DAILY["daily_pipeline.py"]

Setting Up with pyproject.toml

[project]
name = "data-pipeline"
version = "0.1.0"
requires-python = ">=3.11"

dependencies = [
    "pandas>=2.0",
    "sqlalchemy>=2.0",
    "pydantic>=2.0",
    "pyarrow>=14.0",
    "python-dotenv>=1.0",
    "requests>=2.31",
]

[project.optional-dependencies]
dev = [
    "pytest>=7.0",
    "ruff>=0.1.0",
    "mypy>=1.5",
]

Creating Your Environment

# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install the project in editable mode
pip install -e ".[dev]"

Type Hints for Data Code

Type hints make data pipelines far more maintainable. They catch bugs early, improve IDE support, and serve as living documentation.

Basic Type Hints

from typing import Optional

def calculate_revenue(
    price: float,
    quantity: int,
    discount: Optional[float] = None,
) -> float:
    """Calculate revenue with optional discount."""
    total = price * quantity
    if discount is not None:
        total *= (1 - discount)
    return total

Type Hints with Pandas

import pandas as pd

def clean_customer_data(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and standardise customer data."""
    return (
        df
        .dropna(subset=["email"])
        .assign(
            email=lambda x: x["email"].str.lower().str.strip(),
            name=lambda x: x["name"].str.title(),
        )
    )

def load_csv(path: str) -> pd.DataFrame:
    """Load a CSV file into a DataFrame."""
    return pd.read_csv(path)

TypedDict for Row Schemas

from typing import TypedDict

class CustomerRow(TypedDict):
    id: int
    name: str
    email: str
    signup_date: str
    total_orders: int

def process_customer(row: CustomerRow) -> dict[str, str | int]:
    """Process a single customer row."""
    return {
        "id": row["id"],
        "display_name": row["name"].upper(),
        "order_count": row["total_orders"],
    }

Pydantic Models for Validation

from pydantic import BaseModel, EmailStr
from datetime import date

class Customer(BaseModel):
    id: int
    name: str
    email: EmailStr
    signup_date: date
    total_orders: int = 0

# Validates at runtime
customer = Customer(
    id=1,
    name="Alice Smith",
    email="alice@example.com",
    signup_date="2024-01-15",
    total_orders=42,
)

Configuring Your Environment

Environment Variables with python-dotenv

from dotenv import load_dotenv
import os

load_dotenv()

DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///local.db")
API_KEY = os.getenv("API_KEY")
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "1000"))

Configuration with Pydantic Settings

from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    database_url: str = "sqlite:///local.db"
    api_key: str
    batch_size: int = 1000
    debug: bool = False

    class Config:
        env_file = ".env"

settings = Settings()
print(settings.database_url)

Running Type Checks

# Install mypy
pip install mypy

# Run type checking
mypy src/ --ignore-missing-imports

# For stricter checking
mypy src/ --strict --ignore-missing-imports

Example mypy Output

src/transform/clean.py:15: error: Argument "price" to "calculate_revenue"
    has incompatible type "str"; expected "float"
Found 1 error in 1 file (checked 5 source files)

Summary

Python is the dominant language for data engineering thanks to its rich ecosystem and readability.
Key libraries include pandas, SQLAlchemy, Pydantic, PyArrow, and Pytest.
Organise projects with a clear ETL directory structure separating extract, transform, and load.
Use type hints extensively — they serve as documentation, catch bugs, and improve IDE support.
Use Pydantic for runtime data validation and mypy for static type checking.
Configure environments with python-dotenv or Pydantic Settings to manage secrets safely.

Python Data Engineering Fundamentals

Python Data Engineering Fundamentals

Why Python for Data Engineering?

The Data Engineering Landscape

Key Libraries Overview

Project Setup

Recommended Project Structure

Setting Up with pyproject.toml

Creating Your Environment

Type Hints for Data Code

Basic Type Hints

Type Hints with Pandas

TypedDict for Row Schemas

Pydantic Models for Validation

Configuring Your Environment

Environment Variables with python-dotenv

Configuration with Pydantic Settings

Running Type Checks

Example mypy Output

Summary

More in Programming