Amazon Redshift and Data Warehousing

Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service. It enables you to run complex analytical queries against structured and semi-structured data using standard SQL. Redshift is optimised for online analytical processing (OLAP) — aggregating, joining, and summarising billions of rows — rather than the transactional (OLTP) workloads served by RDS or Aurora.

OLTP vs OLAP

Characteristic	OLTP (RDS, Aurora, DynamoDB)	OLAP (Redshift)
Query type	Short, simple (INSERT, SELECT by key)	Long, complex (GROUP BY, JOIN, window functions)
Data volume	Gigabytes to low terabytes	Terabytes to petabytes
Users	Application servers (thousands of concurrent transactions)	Analysts, BI tools (tens to hundreds of concurrent queries)
Schema design	Normalised (3NF)	Denormalised (star / snowflake schema)
Optimised for	Low-latency reads and writes	High-throughput scans and aggregations

Redshift Architecture

Cluster Components

A Redshift cluster consists of:

Leader node — receives queries, builds execution plans, distributes work to compute nodes, and aggregates results
Compute nodes — store data and execute query fragments in parallel
Node slices — each compute node is divided into slices; each slice processes a portion of the data in parallel

┌──────────────────────────────────────┐
│            Leader Node               │
│   (query parsing, planning, agg)     │
└──────────┬──────────┬────────────────┘
           │          │
    ┌──────┴───┐ ┌────┴─────┐
    │ Compute  │ │ Compute  │
    │ Node 1   │ │ Node 2   │
    │ Slice A  │ │ Slice C  │
    │ Slice B  │ │ Slice D  │
    └──────────┘ └──────────┘

Node Types

Type	Family	Storage	Use Case
RA3	ra3.xlplus, ra3.4xlarge, ra3.16xlarge	Managed storage (S3-backed)	Most workloads; scales compute and storage independently
DC2	dc2.large, dc2.8xlarge	Local SSD	Workloads < 1 TB; lowest cost for small datasets
DS2	ds2.xlarge, ds2.8xlarge	Local HDD	Legacy; replaced by RA3

Recommendation: Use RA3 nodes for new clusters. They decouple compute from storage, so you pay for compute and storage independently. Hot data is cached on local SSD; cold data lives in S3.

Columnar Storage

Unlike row-based databases (MySQL, PostgreSQL), Redshift stores data column by column. This architecture is ideal for analytics:

Compression — each column contains similar data types, enabling high compression ratios (often 4–10×)
Efficient scans — a query that reads 3 of 100 columns scans only those 3 columns, not the entire row
Zone maps — Redshift tracks min/max values per block, skipping blocks that cannot contain the target data

Encoding (Compression) Types

Encoding	Best For
AZ64	Numeric and date/time columns (default for these types)
LZO	Long strings, VARCHAR
ZSTD	General purpose, good compression ratio
RAW	No compression; rarely used
BYTEDICT	Low-cardinality strings (e.g., country codes)

Redshift automatically chooses optimal encoding when you use COPY to load data.

Distribution Styles

When a query joins two large tables, Redshift must move data between nodes to perform the join. The distribution style determines how data is spread across nodes:

Amazon Redshift and Data Warehousing

Amazon Redshift and Data Warehousing

OLTP vs OLAP

Redshift Architecture

Cluster Components

Node Types

Columnar Storage

Encoding (Compression) Types

Distribution Styles

More in Cloud