File Formats & Serialisation

Choosing the right file format can dramatically affect the performance, cost, and reliability of your data pipelines. This lesson covers CSV, Parquet, Avro, and JSON — their strengths, weaknesses, compression options, schema evolution, and when to use each.

Format Comparison

Feature	CSV	JSON / JSONL	Parquet	Avro
Human-readable	Yes	Yes	No	No
Schema embedded	No	No	Yes	Yes
Columnar storage	No	No	Yes	No (row-based)
Compression	External	External	Built-in	Built-in
Nested data	No	Yes	Yes	Yes
Typical size	Large	Large	Small	Small-Medium
Read speed	Slow	Medium	Very fast	Fast
Schema evolution	No	No	Limited	Yes
Ecosystem support	Universal	Universal	Big data	Big data

CSV

CSV is ubiquitous but has many pitfalls.

import pandas as pd

# Writing CSV
df.to_csv("output/data.csv", index=False)

# Reading with explicit options
df = pd.read_csv(
    "data/input.csv",
    dtype={"id": int, "name": str, "amount": float},
    parse_dates=["created_at"],
    na_values=["", "N/A", "null", "None"],
    encoding="utf-8",
)

CSV Pitfalls

Pitfall	Example	Solution
No schema	Types guessed incorrectly	Specify `dtype` explicitly
Commas in values	`"London, UK"` breaks parsing	Use proper quoting
Encoding issues	Non-ASCII characters corrupted	Always specify `encoding="utf-8"`
Large file sizes	10x bigger than Parquet	Compress or switch to Parquet

Parquet

Parquet is the gold standard for analytical data. It is columnar, compressed, and fast.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Write Parquet from pandas
df.to_parquet("output/data.parquet", index=False, compression="snappy")

# Read Parquet — only the columns you need
df = pd.read_parquet("data/data.parquet", columns=["id", "name", "amount"])

# Write with PyArrow directly (more control)
table = pa.Table.from_pandas(df)
pq.write_table(table, "output/data.parquet", compression="snappy")

# Read with row group filtering
parquet_file = pq.ParquetFile("data/large_file.parquet")
for batch in parquet_file.iter_batches(batch_size=10000, columns=["id", "amount"]):
    df_chunk = batch.to_pandas()
    process(df_chunk)

Why Parquet is Fast

ROW-BASED (CSV)                COLUMNAR (Parquet)
┌────┬──────┬───────┐          ┌────┬────┬────┬────┐
│ id │ name │ amt   │          │ id │ id │ id │ id │  ← column 1
├────┼──────┼───────┤          ├────┴────┴────┴────┤
│ 1  │ Alice│ 50.00 │          │name│name│name│name│  ← column 2
│ 2  │ Bob  │ 75.00 │          ├────┴────┴────┴────┤
│ 3  │ Eve  │ 30.00 │          │ amt│ amt│ amt│ amt│  ← column 3
└────┴──────┴───────┘          └────┴────┴────┴────┘

Query: SELECT SUM(amt)         Only reads the 'amt' column!
Reads ALL columns              Much less data from disk.

Avro

Avro is a row-based format with embedded schemas, popular in event streaming (Kafka).

import fastavro
from io import BytesIO

# Define schema
schema = {
    "type": "record",
    "name": "Customer",
    "fields": [
        {"name": "id", "type": "int"},
        {"name": "name", "type": "string"},
        {"name": "email", "type": ["null", "string"], "default": None},
    ],
}

# Write Avro
records = [
    {"id": 1, "name": "Alice", "email": "alice@co.com"},
    {"id": 2, "name": "Bob", "email": None},
]

with open("output/data.avro", "wb") as f:
    fastavro.writer(f, schema, records)

# Read Avro
with open("output/data.avro", "rb") as f:
    reader = fastavro.reader(f)
    for record in reader:
        print(record)

File Formats & Serialisation

File Formats & Serialisation

Format Comparison

CSV

CSV Pitfalls

Parquet

Why Parquet is Fast

Avro

More in Programming