You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
ETL (Extract-Transform-Load) is the backbone of data engineering. An ETL pipeline extracts data from one or more sources, transforms it into a usable format, and loads it into a destination. This lesson covers pipeline architecture, idempotency, incremental loads, and error handling.
┌──────────────┐ ┌────────────────┐ ┌──────────────┐
│ EXTRACT │────▶│ TRANSFORM │────▶│ LOAD │
│ │ │ │ │ │
│ - APIs │ │ - Clean │ │ - Database │
│ - Databases │ │ - Validate │ │ - Data Lake │
│ - Files │ │ - Enrich │ │ - Warehouse │
│ - Streams │ │ - Aggregate │ │ - Files │
└──────────────┘ └────────────────┘ └──────────────┘
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime
def extract(api_url: str) -> pd.DataFrame:
"""Extract data from an API endpoint."""
import requests
response = requests.get(api_url, timeout=30)
response.raise_for_status()
return pd.DataFrame(response.json()["data"])
def transform(df: pd.DataFrame) -> pd.DataFrame:
"""Clean and transform the extracted data."""
return (
df
.dropna(subset=["email", "name"])
.assign(
email=lambda x: x["email"].str.lower().str.strip(),
name=lambda x: x["name"].str.title(),
extracted_at=datetime.utcnow(),
)
.drop_duplicates(subset=["email"])
)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.