Big Data Concepts

Big data is the term for datasets so large, fast-moving or varied that conventional single-machine tools and relational databases can no longer store or process them within an acceptable time. The AQA specification treats big data not as a vague buzzword but as a precise set of characteristics (the three Vs) with concrete engineering consequences: you must spread storage and processing across many machines, model your data as immutable facts, and choose schemas — relational, graph or otherwise — that fit the questions you need to ask.

Crucially for this course, big data is where functional programming earns its keep in industry. The properties you met earlier — immutability, pure functions and the resulting safe parallelism — are exactly what let a computation be scattered across thousands of machines and recovered cleanly when one fails. This lesson develops the three Vs with examples, explains distributed and parallel processing, works through a MapReduce example step by step, sets out the fact-based immutable data model and graph schemas/databases, and connects big data to machine learning. It is the §4.11 lesson, so the emphasis is on understanding why the engineering is shaped the way it is.

Spec Mapping

This lesson addresses AQA 7517 §4.11 Big Data. You should be able to: explain what big data is and characterise it by the three Vs — volume, velocity and variety; describe distributed and parallel processing across a cluster and why it is needed; understand the functional programming model as the natural fit for big data because immutability and statelessness enable safe parallelism; describe a fact-based, immutable data model in which data is recorded as a growing log of facts rather than mutated in place; explain graph schemas and graph databases and when they are appropriate; and outline the relationship between big data and machine learning. It draws directly on the functional-paradigm characteristics of §4.12.1 (immutability, pure functions) and links to databases (§4.10).

The descriptions here are our own restatement of the assessable content, not a verbatim quotation of AQA's specification.

What Counts as "Big" — the Three Vs

Big data is characterised by three dimensions, the "three Vs". The key insight is that bigness is not just about size: a dataset can demand big-data techniques because of how fast or how varied it is, even if the total volume is modest.

Volume — how much

The sheer amount of data, often terabytes (10¹² bytes), petabytes (10¹⁵) or beyond — far more than any single machine can store or scan in reasonable time.

Social platforms generating billions of posts, likes and messages per day.
Scientific instruments such as particle colliders producing petabytes of readings.
Retailers retaining years of transactions for millions of customers.

Velocity — how fast

The rate at which data arrives and the speed at which it must be processed. Some workloads need real-time or near-real-time handling rather than overnight batch jobs.

Financial trading systems ingesting millions of price updates per second.
Networks of IoT sensors emitting continuous telemetry streams.
Live dashboards and feeds that must reflect events within seconds.

Variety — how many forms

The range of types and formats the data takes.

Data type	Description	Examples
Structured	Rigid schema, fits neatly into tables	Relational databases, spreadsheets
Semi-structured	Some structure but no fixed schema	JSON, XML, CSV
Unstructured	No predefined format	Free text, images, audio, video, email

Beyond the core three, some models add veracity (how trustworthy/clean the data is) and value (the usefulness actually extractable). For AQA the three Vs are the essential ones to know and exemplify; treat veracity and value as useful extensions rather than the headline.

Why Traditional Approaches Fail

Each V breaks a different assumption baked into traditional single-machine, relational processing.

Challenge	Traditional assumption	Why big data breaks it	Which V
Storage	One server holds the data	Data exceeds any single machine's capacity	Volume
Processing	Scan sequentially on one CPU	One machine is far too slow for the volume	Volume
Latency	Overnight batch jobs are fine	Results needed in seconds, not hours	Velocity
Schema	Fixed relational schema known in advance	Data arrives in many, evolving formats	Variety

The escape from all four is the same idea: stop relying on one big machine and instead coordinate many ordinary ones — distributed and parallel processing.

Distributed and Parallel Processing

Distributed processing spreads both the data and the work across many computers (nodes) connected as a cluster; parallel processing means those nodes work on different pieces simultaneously, so the wall-clock time falls roughly in proportion to the number of nodes.

Concept	Meaning
Cluster	A group of connected computers acting as one system
Node	A single computer within the cluster
Data partitioning (sharding)	Splitting the dataset into pieces, one per node
Parallel processing	Many nodes processing their partitions at the same time
Replication	Storing copies of each partition on several nodes
Fault tolerance	The system keeps working when individual nodes fail

Two consequences matter. First, failure is normal, not exceptional: across thousands of commodity machines, some node is always failing, so the system must tolerate it — hence replication (no data is lost when a machine dies) and the ability to re-run a failed task elsewhere. Second, mutable shared state would be catastrophic at this scale: coordinating locks across thousands of machines over a network would be unbearably slow and deadlock-prone. This is the precise point at which functional programming becomes not just elegant but necessary. It is worth stressing how this differs from ordinary multithreading on one computer: there is no shared memory between nodes (each has its own), communication happens over a comparatively slow network, and partial failure of individual machines is routine rather than rare — so the design must assume independence and recoverability from the outset, not bolt them on afterwards.

graph TD
    D["Huge dataset"] --> P["Partition into shards"]
    P --> N1["Node 1 processes shard A"]
    P --> N2["Node 2 processes shard B"]
    P --> N3["Node 3 processes shard C"]
    N1 --> C["Combine partial results"]
    N2 --> C
    N3 --> C
    C --> R["Final answer"]

Why the Functional Model Fits Big Data

The fit is not a coincidence; the two core functional properties map directly onto the two hardest distributed-systems problems.

Immutability ⇒ safe sharing. If data is never modified in place, many nodes can read the same partition with no locks and no risk of one corrupting another's view. There is no race condition because there is no shared writable cell. (This is exactly the argument from the immutability lesson, now at cluster scale.)
Purity ⇒ safe re-execution. A pure, deterministic function gives the same output for the same input every time. So if a node dies mid-task, the scheduler can simply re-run that task on another node and be certain of getting the same result — the foundation of fault tolerance.

Together these give statelessness: each unit of work depends only on its input partition, not on any conversation with other nodes. Stateless, independent tasks are embarrassingly parallel — they scale to thousands of machines because there is nothing to coordinate. That is the property the MapReduce model is built to exploit.

MapReduce — a Worked Example

MapReduce is a programming model for processing large datasets in parallel across a cluster, popularised by Google and named directly after the functional map and reduce (fold) functions. A computation is expressed as two pure functions:

a map function applied independently to each input record, emitting intermediate key–value pairs;
a reduce function that combines all values sharing a key into a single result.

Between them the framework performs a shuffle/group-by-key step, gathering every value for a given key onto one reducer. The programmer writes only the two pure functions; the framework handles partitioning, parallel scheduling, the shuffle, and re-running failed tasks.

Word count, traced

Goal: count how often each word appears across a collection of documents. Suppose three nodes hold one line each:

Node 1: "the cat sat"
Node 2: "the dog sat"
Node 3: "the cat ran"

Map phase — each node runs map on its line independently, emitting (word, 1) for every word:

Node	Input line	Emitted pairs
1	`"the cat sat"`	`("the",1) ("cat",1) ("sat",1)`
2	`"the dog sat"`	`("the",1) ("dog",1) ("sat",1)`
3	`"the cat ran"`	`("the",1) ("cat",1) ("ran",1)`

Shuffle / group-by-key — the framework gathers all values for each key:

"the" -> [1, 1, 1]
"cat" -> [1, 1]
"sat" -> [1, 1]
"dog" -> [1]
"ran" -> [1]

Reduce phase — reduce sums the list for each key (this is a fold with (+)):

Key	Values	Reduced result
`"the"`	`[1,1,1]`	`3`
`"cat"`	`[1,1]`	`2`
`"sat"`	`[1,1]`	`2`
`"dog"`	`[1]`	`1`
`"ran"`	`[1]`	`1`

The whole flow, showing the parallel map fanning out and the reduce gathering in:

graph TD
    L1["Node 1: the cat sat"] --> M1["map -> pairs"]
    L2["Node 2: the dog sat"] --> M2["map -> pairs"]
    L3["Node 3: the cat ran"] --> M3["map -> pairs"]
    M1 --> S["Shuffle: group by key"]
    M2 --> S
    M3 --> S
    S --> R1["reduce 'the' -> 3"]
    S --> R2["reduce 'cat' -> 2"]
    S --> R3["reduce others"]

The map and reduce functions themselves are tiny and pure — here is the local Haskell intuition for the two stages on a single document, which is conceptually what each node computes:

-- the per-record map: each word becomes a (word, 1) pair
wordPairs :: String -> [(String, Int)]
wordPairs line = [ (w, 1) | w <- words line ]

-- the per-key reduce: fold the list of counts with (+)
reduceCounts :: [Int] -> Int
reduceCounts = foldr (+) 0

Why MapReduce works — the functional connection

Map and reduce are pure functions — same input, same output, no side effects.
No shared mutable state — each map task sees only its own partition, so tasks cannot interfere; this is statelessness.
Embarrassingly parallel — because map tasks are independent, the framework can run them on thousands of nodes at once.
Fault tolerant — if a node fails, its task is simply re-run elsewhere; purity and determinism guarantee the re-run gives the same result.

Every one of these is a direct consequence of immutability and purity — the abstract virtues of §4.12.1 cashed out as concrete scalability and reliability.

A second worked example — average sale per region

Word count uses a simple summing reducer; many real tasks need a little more care, and the average is the classic example. Suppose each record is a (region, amount) sale and we want the mean sale per region.

The naive idea — "have each node compute its own average, then average the averages" — is wrong, because averaging averages ignores how many records each contributed. The functional fix is to make the reducer combine sum-and-count pairs rather than finished averages, because sum and count are both associative and so safe to compute partially and merge:

Phase	Operation	Example
Map	each sale → `(region, (amount, 1))`	`("North", (50, 1))`
Shuffle	group pairs by region	`"North" -> [(50,1), (70,1), (30,1)]`
Reduce	sum the amounts and the counts elementwise	`"North" -> (150, 3)`
Finalise	divide total by count	`150 / 3 = 50.0`

-- the per-record map emits a (sum, count) contribution of (amount, 1)
mapSale :: (String, Int) -> (String, (Int, Int))
mapSale (region, amount) = (region, (amount, 1))

-- the reduce merges two (sum, count) pairs — associative, so partials are safe
combine :: (Int, Int) -> (Int, Int) -> (Int, Int)
combine (s1, c1) (s2, c2) = (s1 + s2, c1 + c2)

-- after reducing, the average is one final division
average :: (Int, Int) -> Double
average (total, count) = fromIntegral total / fromIntegral count

Because combine is associative (and (0, 0) is its identity), the framework may apply it in any grouping and even run a partial reduce on each mapper before the shuffle — an optimisation called a combiner that slashes the volume of data sent across the network. This only works because the operation is a pure, associative fold; it is the abstract algebra of fold (§4.12.3) paying off at cluster scale.

Why parallelism gives near-linear speed-up

The payoff of statelessness is quantitative. If a dataset of $N$ records takes time $T$ to process on one node, then partitioning it across $p$ independent nodes that never coordinate ideally takes about:

T_{p} \approx \frac{T}{p} + T_{\text{shuffle}}

The $T/p$ term is the parallel speed-up; the $T_{\text{shuffle}}$ term is the unavoidable cost of moving and grouping intermediate data over the network. The reason FP gets close to the ideal $T/p$ is that there is no synchronisation term — no lock contention, no waiting on shared state — because immutable, stateless tasks have nothing to contend over. A combiner that shrinks the shuffle further reduces $T_{\text{shuffle}}$ , which is why associativity matters so much in practice.

Big Data Concepts

Big Data Concepts

Spec Mapping

What Counts as "Big" — the Three Vs

Volume — how much

Velocity — how fast

Variety — how many forms

Why Traditional Approaches Fail

Distributed and Parallel Processing

Why the Functional Model Fits Big Data

MapReduce — a Worked Example

Word count, traced

Why MapReduce works — the functional connection

A second worked example — average sale per region

Why parallelism gives near-linear speed-up

The Fact-Based, Immutable Data Model

More in Computer Science