You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Pandas is the cornerstone library for data manipulation in Python. As a data engineer, you will use it daily to read, filter, transform, aggregate, and write data. This lesson covers DataFrames, reading and writing multiple formats, filtering, grouping, joins, and method chaining.
A DataFrame is a two-dimensional, labelled data structure — think of it as a spreadsheet or SQL table in memory.
import pandas as pd
# Create a DataFrame from a dictionary
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana"],
"department": ["Engineering", "Marketing", "Engineering", "Marketing"],
"salary": [95000, 72000, 105000, 68000],
"start_date": ["2021-03-15", "2020-07-01", "2019-11-20", "2022-01-10"],
})
print(df)
# name department salary start_date
# 0 Alice Engineering 95000 2021-03-15
# 1 Bob Marketing 72000 2020-07-01
# 2 Charlie Engineering 105000 2019-11-20
# 3 Diana Marketing 68000 2022-01-10
| Attribute | Description | Example Output |
|---|---|---|
df.shape | Rows and columns | (4, 4) |
df.dtypes | Data type of each column | name: object, ... |
df.columns | Column names | Index(['name', ...]) |
df.info() | Summary of the DataFrame | Types, non-null counts |
df.describe() | Statistics for numeric cols | Mean, std, min, max |
Pandas can read from virtually any tabular format.
df = pd.read_csv("data/customers.csv")
# With options
df = pd.read_csv(
"data/customers.csv",
sep=",",
dtype={"id": int, "name": str},
parse_dates=["created_at"],
na_values=["N/A", "null", ""],
)
# Requires pyarrow or fastparquet
df = pd.read_parquet("data/orders.parquet")
# Read specific columns (efficient — only reads those columns from disk)
df = pd.read_parquet("data/orders.parquet", columns=["order_id", "total"])
# Standard JSON array of objects
df = pd.read_json("data/events.json")
# JSON Lines (one JSON object per line)
df = pd.read_json("data/events.jsonl", lines=True)
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/mydb")
df = pd.read_sql("SELECT * FROM orders WHERE status = 'active'", engine)
# CSV
df.to_csv("output/results.csv", index=False)
# Parquet (columnar, compressed, fast)
df.to_parquet("output/results.parquet", index=False)
# JSON Lines
df.to_json("output/results.jsonl", orient="records", lines=True)
| Format | Size | Read Speed | Write Speed | Schema | Human-Readable |
|---|---|---|---|---|---|
| CSV | Large | Slow | Fast | No | Yes |
| Parquet | Small | Very fast | Fast | Yes | No |
| JSON | Large | Medium | Medium | No | Yes |
| JSONL | Large | Medium | Medium | No | Yes |
# Single condition
high_earners = df[df["salary"] > 80000]
# Multiple conditions (use & for AND, | for OR)
eng_high = df[(df["department"] == "Engineering") & (df["salary"] > 90000)]
# Using .query() for cleaner syntax
eng_high = df.query("department == 'Engineering' and salary > 90000")
# Single column (returns Series)
names = df["name"]
# Multiple columns (returns DataFrame)
subset = df[["name", "salary"]]
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.