You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Choosing the right file format can dramatically affect the performance, cost, and reliability of your data pipelines. This lesson covers CSV, Parquet, Avro, and JSON — their strengths, weaknesses, compression options, schema evolution, and when to use each.
| Feature | CSV | JSON / JSONL | Parquet | Avro |
|---|---|---|---|---|
| Human-readable | Yes | Yes | No | No |
| Schema embedded | No | No | Yes | Yes |
| Columnar storage | No | No | Yes | No (row-based) |
| Compression | External | External | Built-in | Built-in |
| Nested data | No | Yes | Yes | Yes |
| Typical size | Large | Large | Small | Small-Medium |
| Read speed | Slow | Medium | Very fast | Fast |
| Schema evolution | No | No | Limited | Yes |
| Ecosystem support | Universal | Universal | Big data | Big data |
CSV is ubiquitous but has many pitfalls.
import pandas as pd
# Writing CSV
df.to_csv("output/data.csv", index=False)
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.