You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Big data is the term for datasets so large, fast-moving or varied that conventional single-machine tools and relational databases can no longer store or process them within an acceptable time. The AQA specification treats big data not as a vague buzzword but as a precise set of characteristics (the three Vs) with concrete engineering consequences: you must spread storage and processing across many machines, model your data as immutable facts, and choose schemas — relational, graph or otherwise — that fit the questions you need to ask.
Crucially for this course, big data is where functional programming earns its keep in industry. The properties you met earlier — immutability, pure functions and the resulting safe parallelism — are exactly what let a computation be scattered across thousands of machines and recovered cleanly when one fails. This lesson develops the three Vs with examples, explains distributed and parallel processing, works through a MapReduce example step by step, sets out the fact-based immutable data model and graph schemas/databases, and connects big data to machine learning. It is the §4.11 lesson, so the emphasis is on understanding why the engineering is shaped the way it is.
This lesson addresses AQA 7517 §4.11 Big Data. You should be able to: explain what big data is and characterise it by the three Vs — volume, velocity and variety; describe distributed and parallel processing across a cluster and why it is needed; understand the functional programming model as the natural fit for big data because immutability and statelessness enable safe parallelism; describe a fact-based, immutable data model in which data is recorded as a growing log of facts rather than mutated in place; explain graph schemas and graph databases and when they are appropriate; and outline the relationship between big data and machine learning. It draws directly on the functional-paradigm characteristics of §4.12.1 (immutability, pure functions) and links to databases (§4.10).
The descriptions here are our own restatement of the assessable content, not a verbatim quotation of AQA's specification.
Big data is characterised by three dimensions, the "three Vs". The key insight is that bigness is not just about size: a dataset can demand big-data techniques because of how fast or how varied it is, even if the total volume is modest.
The sheer amount of data, often terabytes (10¹² bytes), petabytes (10¹⁵) or beyond — far more than any single machine can store or scan in reasonable time.
The rate at which data arrives and the speed at which it must be processed. Some workloads need real-time or near-real-time handling rather than overnight batch jobs.
The range of types and formats the data takes.
| Data type | Description | Examples |
|---|---|---|
| Structured | Rigid schema, fits neatly into tables | Relational databases, spreadsheets |
| Semi-structured | Some structure but no fixed schema | JSON, XML, CSV |
| Unstructured | No predefined format | Free text, images, audio, video, email |
Beyond the core three, some models add veracity (how trustworthy/clean the data is) and value (the usefulness actually extractable). For AQA the three Vs are the essential ones to know and exemplify; treat veracity and value as useful extensions rather than the headline.
Each V breaks a different assumption baked into traditional single-machine, relational processing.
| Challenge | Traditional assumption | Why big data breaks it | Which V |
|---|---|---|---|
| Storage | One server holds the data | Data exceeds any single machine's capacity | Volume |
| Processing | Scan sequentially on one CPU | One machine is far too slow for the volume | Volume |
| Latency | Overnight batch jobs are fine | Results needed in seconds, not hours | Velocity |
| Schema | Fixed relational schema known in advance | Data arrives in many, evolving formats | Variety |
The escape from all four is the same idea: stop relying on one big machine and instead coordinate many ordinary ones — distributed and parallel processing.
Distributed processing spreads both the data and the work across many computers (nodes) connected as a cluster; parallel processing means those nodes work on different pieces simultaneously, so the wall-clock time falls roughly in proportion to the number of nodes.
| Concept | Meaning |
|---|---|
| Cluster | A group of connected computers acting as one system |
| Node | A single computer within the cluster |
| Data partitioning (sharding) | Splitting the dataset into pieces, one per node |
| Parallel processing | Many nodes processing their partitions at the same time |
| Replication | Storing copies of each partition on several nodes |
| Fault tolerance | The system keeps working when individual nodes fail |
Two consequences matter. First, failure is normal, not exceptional: across thousands of commodity machines, some node is always failing, so the system must tolerate it — hence replication (no data is lost when a machine dies) and the ability to re-run a failed task elsewhere. Second, mutable shared state would be catastrophic at this scale: coordinating locks across thousands of machines over a network would be unbearably slow and deadlock-prone. This is the precise point at which functional programming becomes not just elegant but necessary. It is worth stressing how this differs from ordinary multithreading on one computer: there is no shared memory between nodes (each has its own), communication happens over a comparatively slow network, and partial failure of individual machines is routine rather than rare — so the design must assume independence and recoverability from the outset, not bolt them on afterwards.
graph TD
D["Huge dataset"] --> P["Partition into shards"]
P --> N1["Node 1 processes shard A"]
P --> N2["Node 2 processes shard B"]
P --> N3["Node 3 processes shard C"]
N1 --> C["Combine partial results"]
N2 --> C
N3 --> C
C --> R["Final answer"]
The fit is not a coincidence; the two core functional properties map directly onto the two hardest distributed-systems problems.
Together these give statelessness: each unit of work depends only on its input partition, not on any conversation with other nodes. Stateless, independent tasks are embarrassingly parallel — they scale to thousands of machines because there is nothing to coordinate. That is the property the MapReduce model is built to exploit.
MapReduce is a programming model for processing large datasets in parallel across a cluster, popularised by Google and named directly after the functional map and reduce (fold) functions. A computation is expressed as two pure functions:
Between them the framework performs a shuffle/group-by-key step, gathering every value for a given key onto one reducer. The programmer writes only the two pure functions; the framework handles partitioning, parallel scheduling, the shuffle, and re-running failed tasks.
Goal: count how often each word appears across a collection of documents. Suppose three nodes hold one line each:
Node 1: "the cat sat"
Node 2: "the dog sat"
Node 3: "the cat ran"
Map phase — each node runs map on its line independently, emitting (word, 1) for every word:
| Node | Input line | Emitted pairs |
|---|---|---|
| 1 | "the cat sat" | ("the",1) ("cat",1) ("sat",1) |
| 2 | "the dog sat" | ("the",1) ("dog",1) ("sat",1) |
| 3 | "the cat ran" | ("the",1) ("cat",1) ("ran",1) |
Shuffle / group-by-key — the framework gathers all values for each key:
"the" -> [1, 1, 1]
"cat" -> [1, 1]
"sat" -> [1, 1]
"dog" -> [1]
"ran" -> [1]
Reduce phase — reduce sums the list for each key (this is a fold with (+)):
| Key | Values | Reduced result |
|---|---|---|
"the" | [1,1,1] | 3 |
"cat" | [1,1] | 2 |
"sat" | [1,1] | 2 |
"dog" | [1] | 1 |
"ran" | [1] | 1 |
The whole flow, showing the parallel map fanning out and the reduce gathering in:
graph TD
L1["Node 1: the cat sat"] --> M1["map -> pairs"]
L2["Node 2: the dog sat"] --> M2["map -> pairs"]
L3["Node 3: the cat ran"] --> M3["map -> pairs"]
M1 --> S["Shuffle: group by key"]
M2 --> S
M3 --> S
S --> R1["reduce 'the' -> 3"]
S --> R2["reduce 'cat' -> 2"]
S --> R3["reduce others"]
The map and reduce functions themselves are tiny and pure — here is the local Haskell intuition for the two stages on a single document, which is conceptually what each node computes:
-- the per-record map: each word becomes a (word, 1) pair
wordPairs :: String -> [(String, Int)]
wordPairs line = [ (w, 1) | w <- words line ]
-- the per-key reduce: fold the list of counts with (+)
reduceCounts :: [Int] -> Int
reduceCounts = foldr (+) 0
Every one of these is a direct consequence of immutability and purity — the abstract virtues of §4.12.1 cashed out as concrete scalability and reliability.
Word count uses a simple summing reducer; many real tasks need a little more care, and the average is the classic example. Suppose each record is a (region, amount) sale and we want the mean sale per region.
The naive idea — "have each node compute its own average, then average the averages" — is wrong, because averaging averages ignores how many records each contributed. The functional fix is to make the reducer combine sum-and-count pairs rather than finished averages, because sum and count are both associative and so safe to compute partially and merge:
| Phase | Operation | Example |
|---|---|---|
| Map | each sale → (region, (amount, 1)) | ("North", (50, 1)) |
| Shuffle | group pairs by region | "North" -> [(50,1), (70,1), (30,1)] |
| Reduce | sum the amounts and the counts elementwise | "North" -> (150, 3) |
| Finalise | divide total by count | 150 / 3 = 50.0 |
-- the per-record map emits a (sum, count) contribution of (amount, 1)
mapSale :: (String, Int) -> (String, (Int, Int))
mapSale (region, amount) = (region, (amount, 1))
-- the reduce merges two (sum, count) pairs — associative, so partials are safe
combine :: (Int, Int) -> (Int, Int) -> (Int, Int)
combine (s1, c1) (s2, c2) = (s1 + s2, c1 + c2)
-- after reducing, the average is one final division
average :: (Int, Int) -> Double
average (total, count) = fromIntegral total / fromIntegral count
Because combine is associative (and (0, 0) is its identity), the framework may apply it in any grouping and even run a partial reduce on each mapper before the shuffle — an optimisation called a combiner that slashes the volume of data sent across the network. This only works because the operation is a pure, associative fold; it is the abstract algebra of fold (§4.12.3) paying off at cluster scale.
The payoff of statelessness is quantitative. If a dataset of N records takes time T to process on one node, then partitioning it across p independent nodes that never coordinate ideally takes about:
Tp≈pT+TshuffleThe T/p term is the parallel speed-up; the Tshuffle term is the unavoidable cost of moving and grouping intermediate data over the network. The reason FP gets close to the ideal T/p is that there is no synchronisation term — no lock contention, no waiting on shared state — because immutable, stateless tasks have nothing to contend over. A combiner that shrinks the shuffle further reduces Tshuffle, which is why associativity matters so much in practice.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.