You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Choosing the right file format can dramatically affect the performance, cost, and reliability of your data pipelines. This lesson covers CSV, Parquet, Avro, and JSON — their strengths, weaknesses, compression options, schema evolution, and when to use each.
| Feature | CSV | JSON / JSONL | Parquet | Avro |
|---|---|---|---|---|
| Human-readable | Yes | Yes | No | No |
| Schema embedded | No | No | Yes | Yes |
| Columnar storage | No | No | Yes | No (row-based) |
| Compression | External | External | Built-in | Built-in |
| Nested data | No | Yes | Yes | Yes |
| Typical size | Large | Large | Small | Small-Medium |
| Read speed | Slow | Medium | Very fast | Fast |
| Schema evolution | No | No | Limited | Yes |
| Ecosystem support | Universal | Universal | Big data | Big data |
CSV is ubiquitous but has many pitfalls.
import pandas as pd
# Writing CSV
df.to_csv("output/data.csv", index=False)
# Reading with explicit options
df = pd.read_csv(
"data/input.csv",
dtype={"id": int, "name": str, "amount": float},
parse_dates=["created_at"],
na_values=["", "N/A", "null", "None"],
encoding="utf-8",
)
| Pitfall | Example | Solution |
|---|---|---|
| No schema | Types guessed incorrectly | Specify dtype explicitly |
| Commas in values | "London, UK" breaks parsing | Use proper quoting |
| Encoding issues | Non-ASCII characters corrupted | Always specify encoding="utf-8" |
| Large file sizes | 10x bigger than Parquet | Compress or switch to Parquet |
Parquet is the gold standard for analytical data. It is columnar, compressed, and fast.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
# Write Parquet from pandas
df.to_parquet("output/data.parquet", index=False, compression="snappy")
# Read Parquet — only the columns you need
df = pd.read_parquet("data/data.parquet", columns=["id", "name", "amount"])
# Write with PyArrow directly (more control)
table = pa.Table.from_pandas(df)
pq.write_table(table, "output/data.parquet", compression="snappy")
# Read with row group filtering
parquet_file = pq.ParquetFile("data/large_file.parquet")
for batch in parquet_file.iter_batches(batch_size=10000, columns=["id", "amount"]):
df_chunk = batch.to_pandas()
process(df_chunk)
ROW-BASED (CSV) COLUMNAR (Parquet)
┌────┬──────┬───────┐ ┌────┬────┬────┬────┐
│ id │ name │ amt │ │ id │ id │ id │ id │ ← column 1
├────┼──────┼───────┤ ├────┴────┴────┴────┤
│ 1 │ Alice│ 50.00 │ │name│name│name│name│ ← column 2
│ 2 │ Bob │ 75.00 │ ├────┴────┴────┴────┤
│ 3 │ Eve │ 30.00 │ │ amt│ amt│ amt│ amt│ ← column 3
└────┴──────┴───────┘ └────┴────┴────┴────┘
Query: SELECT SUM(amt) Only reads the 'amt' column!
Reads ALL columns Much less data from disk.
Avro is a row-based format with embedded schemas, popular in event streaming (Kafka).
import fastavro
from io import BytesIO
# Define schema
schema = {
"type": "record",
"name": "Customer",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": None},
],
}
# Write Avro
records = [
{"id": 1, "name": "Alice", "email": "alice@co.com"},
{"id": 2, "name": "Bob", "email": None},
]
with open("output/data.avro", "wb") as f:
fastavro.writer(f, schema, records)
# Read Avro
with open("output/data.avro", "rb") as f:
reader = fastavro.reader(f)
for record in reader:
print(record)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.