You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Python Data Engineering Fundamentals
Python Data Engineering Fundamentals
Data engineering is the discipline of designing, building, and maintaining systems that collect, store, and transform data for analysis and machine learning. Python has become the dominant language in this space thanks to its rich ecosystem and readability. This lesson covers the Python ecosystem for data engineering, key libraries, project setup, and how to use type hints effectively in data code.
Why Python for Data Engineering?
Python dominates data engineering for several reasons:
- Rich ecosystem — hundreds of mature libraries for every data task
- Readability — data pipelines are read far more than they are written
- Community — massive community means excellent documentation and support
- Interoperability — bridges nicely to SQL, Spark, cloud SDKs, and REST APIs
- Rapid prototyping — go from idea to working pipeline quickly
The Data Engineering Landscape
┌────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ APIs │ Databases │ Files │ Streams │ Webhooks │
└────────┬───────────────────────────────────┬───────────┘
│ │
▼ ▼
┌─────────────────┐ ┌───────────────────────┐
│ EXTRACT │ │ ORCHESTRATION │
│ requests │ │ Airflow / Prefect │
│ sqlalchemy │ │ scheduling, retries │
│ boto3 │ └───────────────────────┘
└────────┬────────┘
│
▼
┌─────────────────┐
│ TRANSFORM │
│ pandas / polars │
│ pydantic │
│ great_expect. │
└────────┬────────┘
│
▼
┌─────────────────┐
│ LOAD │
│ sqlalchemy │
│ psycopg2 │
│ pyarrow │
└─────────────────┘
Key Libraries Overview
| Library | Purpose | Install Command |
|---|---|---|
pandas |
Data manipulation and analysis | pip install pandas |
polars |
Fast DataFrame library (Rust-backed) | pip install polars |
sqlalchemy |
Database toolkit and ORM | pip install sqlalchemy |
pydantic |
Data validation and settings | pip install pydantic |
pyarrow |
Parquet and Arrow file support | pip install pyarrow |
great_expectations |
Data quality validation | pip install great_expectations |
requests |
HTTP client for API extraction | pip install requests |
aiohttp |
Async HTTP client | pip install aiohttp |
prefect |
Workflow orchestration | pip install prefect |
pytest |
Testing framework | pip install pytest |
Project Setup
Recommended Project Structure
data-pipeline/
├── .env # secrets (git-ignored)
├── .gitignore
├── pyproject.toml # dependencies and project metadata
├── src/
│ ├── __init__.py
│ ├── extract/
│ │ ├── __init__.py
│ │ ├── api_client.py # API extraction logic
│ │ └── db_reader.py # Database extraction
│ ├── transform/
│ │ ├── __init__.py
│ │ ├── clean.py # Data cleaning functions
│ │ └── enrich.py # Data enrichment
│ ├── load/
│ │ ├── __init__.py
│ │ └── warehouse.py # Load to destination
│ ├── models/
│ │ ├── __init__.py
│ │ └── schemas.py # Pydantic models
│ └── pipeline.py # Main pipeline orchestration
├── tests/
│ ├── __init__.py
│ ├── test_extract.py
│ ├── test_transform.py
│ └── fixtures/
│ └── sample_data.csv
├── dags/ # Airflow DAGs (if using)
│ └── daily_pipeline.py
└── README.md
Setting Up with pyproject.toml
[project]
name = "data-pipeline"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
"pandas>=2.0",
"sqlalchemy>=2.0",
"pydantic>=2.0",
"pyarrow>=14.0",
"python-dotenv>=1.0",
"requests>=2.31",
]
[project.optional-dependencies]
dev = [
"pytest>=7.0",
"ruff>=0.1.0",
"mypy>=1.5",
]
Creating Your Environment
# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install the project in editable mode
pip install -e ".[dev]"
Type Hints for Data Code
Type hints make data pipelines far more maintainable. They catch bugs early, improve IDE support, and serve as living documentation.
Basic Type Hints
from typing import Optional
def calculate_revenue(
price: float,
quantity: int,
discount: Optional[float] = None,
) -> float:
"""Calculate revenue with optional discount."""
total = price * quantity
if discount is not None:
total *= (1 - discount)
return total
Type Hints with Pandas
import pandas as pd
def clean_customer_data(df: pd.DataFrame) -> pd.DataFrame:
"""Clean and standardise customer data."""
return (
df
.dropna(subset=["email"])
.assign(
email=lambda x: x["email"].str.lower().str.strip(),
name=lambda x: x["name"].str.title(),
)
)
def load_csv(path: str) -> pd.DataFrame:
"""Load a CSV file into a DataFrame."""
return pd.read_csv(path)
TypedDict for Row Schemas
from typing import TypedDict
class CustomerRow(TypedDict):
id: int
name: str
email: str
signup_date: str
total_orders: int
def process_customer(row: CustomerRow) -> dict[str, str | int]:
"""Process a single customer row."""
return {
"id": row["id"],
"display_name": row["name"].upper(),
"order_count": row["total_orders"],
}
Pydantic Models for Validation
from pydantic import BaseModel, EmailStr
from datetime import date
class Customer(BaseModel):
id: int
name: str
email: EmailStr
signup_date: date
total_orders: int = 0
# Validates at runtime
customer = Customer(
id=1,
name="Alice Smith",
email="alice@example.com",
signup_date="2024-01-15",
total_orders=42,
)
Configuring Your Environment
Environment Variables with python-dotenv
from dotenv import load_dotenv
import os
load_dotenv()
DATABASE_URL = os.getenv("DATABASE_URL", "sqlite:///local.db")
API_KEY = os.getenv("API_KEY")
BATCH_SIZE = int(os.getenv("BATCH_SIZE", "1000"))
Configuration with Pydantic Settings
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
database_url: str = "sqlite:///local.db"
api_key: str
batch_size: int = 1000
debug: bool = False
class Config:
env_file = ".env"
settings = Settings()
print(settings.database_url)
Running Type Checks
# Install mypy
pip install mypy
# Run type checking
mypy src/ --ignore-missing-imports
# For stricter checking
mypy src/ --strict --ignore-missing-imports
Example mypy Output
src/transform/clean.py:15: error: Argument "price" to "calculate_revenue"
has incompatible type "str"; expected "float"
Found 1 error in 1 file (checked 5 source files)
Summary
- Python is the dominant language for data engineering thanks to its rich ecosystem and readability.
- Key libraries include pandas, SQLAlchemy, Pydantic, PyArrow, and Pytest.
- Organise projects with a clear ETL directory structure separating extract, transform, and load.
- Use type hints extensively — they serve as documentation, catch bugs, and improve IDE support.
- Use Pydantic for runtime data validation and mypy for static type checking.
- Configure environments with python-dotenv or Pydantic Settings to manage secrets safely.