You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Some datasets have grown so large, so fast-moving and so messy that the tidy relational database of the previous lessons — neat tables, fixed schema, one machine running ACID transactions — simply cannot cope. Big data is the name for that regime, and for the different tools and ways of thinking it demands. This lesson defines big data through its defining characteristics (the "Vs": volume, velocity and variety, often extended with veracity and value), separates structured from unstructured data, and explains how big data is stored and processed across clusters of many machines rather than one. The technical heart of the topic — and the part most often missed — is why big data is processed using a functional programming style: pure, side-effect-free functions are exactly what make a computation safe to split across hundreds of processors at once. We finish with the challenges (cost, privacy, quality, bias) and the opportunities (insight, prediction, science) that big data brings.
This lesson addresses the H446 1.3 content on big data and the role of functional programming in its processing:
(Phrasing here paraphrases the specification content; it is not a verbatim quote.)
Big data refers to datasets so large, so rapidly changing, or so varied in form that traditional data-processing tools and a single machine are inadequate to capture, store, manage and analyse them within a useful time. The phrase is relative, not a fixed threshold: "big" means big enough that the ordinary approach breaks down and a qualitatively different approach — distributed storage, parallel processing, specialised databases — becomes necessary. A spreadsheet of a few thousand sales is not big data however carefully you keep it; the continuous global stream of every search query, sensor reading, transaction and social-media post most certainly is.
It helps to see big data as a consequence of three trends running together: the falling cost of storage (keeping everything became cheaper than deciding what to throw away), the spread of always-connected devices generating data continuously, and the development of frameworks that let ordinary computers be ganged together to process it. None alone creates big data; together they made it both possible and unavoidable.
The standard way to characterise big data is through a set of properties conventionally beginning with the letter V. The original three are volume, velocity and variety; two more — veracity and value — are very commonly added and are worth knowing.
| Characteristic | What it means | Illustration (generic) |
|---|---|---|
| Volume | The sheer amount of data — frequently terabytes, petabytes or beyond — far more than a single disk or server could hold or scan. | A large social platform may accumulate new data measured in petabytes over a period; no one machine could store or search it. |
| Velocity | The speed at which new data arrives and the speed at which it must be processed, often in real time or near-real-time. | Sensor or transaction data streaming in continuously that must be analysed as it arrives, not in an overnight batch. |
| Variety | The many different formats and types of data, structured and unstructured, that must be handled together. | Text, images, audio, video, GPS coordinates, log files and clickstreams all in one dataset. |
| Veracity | The trustworthiness and accuracy of the data: it may be noisy, incomplete, inconsistent or simply wrong, and conclusions are only as good as the data behind them. | User-entered fields with typos, sensors that occasionally misreport, duplicated or contradictory records. |
| Value | The usefulness of the data — the insight or benefit that can actually be extracted from it. Data has no worth in itself; the worth is in what analysing it reveals. | The same raw logs are worthless until analysis turns them into a decision (which product to stock, which fault to fix). |
A neat way to remember why each V matters is to pair it with the difficulty it creates: volume challenges storage, velocity challenges processing speed, variety challenges the data model, veracity challenges trust in the result, and value is the justification for tackling the other four at all. Examiners reward answers that explain each V and the problem it poses, not a bare list of the words.
Exam Tip: Learn at least the three core Vs (volume, velocity, variety) with a one-line meaning and an example each; mention veracity and value to show breadth. In a scenario question, apply the Vs to the scenario ("the velocity here is high because readings arrive every second…") rather than reciting generic definitions — the marks are for the application.
A central reason big data needs new tools is that most of it does not fit the neat, fixed-column shape a relational database expects.
| Feature | Structured data | Unstructured data |
|---|---|---|
| Schema | A predefined schema: fixed fields, each with a defined data type. | No predefined schema; the content has internal structure to a human but no fixed fields. |
| Storage | Relational databases (the SQL tables of earlier lessons). | Data lakes, NoSQL stores, distributed file systems. |
| Examples | Customer records, exam results, financial transactions. | Free-text documents, emails, images, audio, video, social-media posts. |
| Querying | Straightforward with SQL and indexes. | Needs specialised processing (text analysis, image/audio recognition) before it can be queried meaningfully. |
| Share of all data | A minority of the data organisations now hold. | The large majority of it. |
Between the two sits semi-structured data: it has no rigid table schema but does carry tags or markers that describe its structure, so a program can parse it. The classic examples are JSON and XML, where elements are labelled ({ "name": "Ada", "age": 36 }) but new fields can appear freely, and log files, which follow a loose line-by-line pattern. Semi-structured formats are the lingua franca of big data because they are flexible enough to absorb variety yet machine-readable enough to process at scale.
The key exam point is why this matters: a relational database demands you fix the schema before you store anything, which is impossible when the data is heterogeneous and arrives faster than any schema could be designed. Big-data systems therefore often store data raw first and impose structure later, at read time ("schema-on-read"), the opposite of the relational "schema-on-write" you met with normalisation.
No single computer can store petabytes or scan them quickly enough, so big data is held and processed on a distributed system: a cluster of many ordinary ("commodity") machines, called nodes, working together and coordinated by software so that, to the user, they behave as one large system.
flowchart TD
SRC[Huge dataset<br/>too big for one machine] --> SPLIT[Split into many blocks]
SPLIT --> N1[Node 1<br/>stores + processes its blocks]
SPLIT --> N2[Node 2<br/>stores + processes its blocks]
SPLIT --> N3[Node 3<br/>stores + processes its blocks]
SPLIT --> N4[Node 4<br/>stores + processes its blocks]
N1 --> AGG[Combine partial results]
N2 --> AGG
N3 --> AGG
N4 --> AGG
AGG --> OUT[Final result]
Two ideas make this work:
This "move the computation to the data, process every block in parallel, then combine the answers" pattern is the engine of big data. But it only works if the per-block computations are genuinely independent — which is exactly the guarantee the next section is about.
This is the part of the topic the specification singles out, and the part candidates most often get wrong. The question is: why is a functional style so well suited to processing big data in parallel?
The answer turns on side effects and shared state. The danger in any parallel computation is that two tasks running at once both read and write the same piece of memory — the race condition and lost-update problems you met with concurrent database transactions. Those problems exist because tasks share mutable state. The locking that fixes them adds waiting, contention and the risk of deadlock, and it does not scale well to thousands of nodes.
Functional programming sidesteps the whole problem by removing the shared mutable state in the first place:
| Functional property | What it means | Why it enables safe parallelism |
|---|---|---|
| Pure functions | A function's output depends only on its inputs, and it has no side effects (it changes nothing outside itself). | Calling the same function on the same data always gives the same answer, whenever and wherever it runs — so the work can be moved to any node and run in any order with no surprises. |
| No shared mutable state | Functions do not read or write global variables that other tasks also touch. | With nothing shared to corrupt, there are no race conditions and no need for locks — the central obstacle to parallelism is gone. |
| Immutable data | Data is never modified in place; transforming it produces a new value, leaving the original untouched. | Many tasks can read the same input simultaneously and safely, because none can change it underneath the others. |
| Higher-order functions (map, filter, reduce/fold) | Functions that take other functions as arguments and apply them across a whole collection. | They express a computation as "apply this independent function to every element" — a description that is trivially parallelisable: the elements can be shared out across nodes. |
The famous concrete realisation is the MapReduce model. A computation is written as two pure functions:
flowchart LR
IN[Input split into blocks] --> M1[map block 1]
IN --> M2[map block 2]
IN --> M3[map block 3]
M1 --> SH[Group by key]
M2 --> SH
M3 --> SH
SH --> R1[reduce key A]
SH --> R2[reduce key B]
R1 --> OUT[Combined result]
R2 --> OUT
Because map is pure and side-effect-free, the framework is free to run it on every block at once, retry it on a different node if a machine dies (re-running a pure function is always safe — it cannot have half-changed anything), and combine the results in any convenient order. That freedom is the whole point. The functional guarantees are precisely what let the framework distribute, parallelise and recover automatically, hiding all of it from the programmer, who need only supply two ordinary functions. An imperative solution full of shared counters and in-place updates would offer none of these guarantees and could not be parallelised so freely or so safely.
Exam Tip: "Why is functional programming used to process big data?" is a stock question. The winning answer is: functions are pure / have no side effects and data is immutable, so there is no shared mutable state, therefore no race conditions and no locks, therefore the work can be split across many processors and run in any order safely (and re-run on failure). Tie it explicitly to map and reduce.
Big data is neither a magic wand nor a menace — like most powerful technology it is both, and a good answer weighs the two.
| Challenge | Explanation |
|---|---|
| Storage and cost | Holding and replicating petabytes needs large clusters, electricity and skilled staff; the infrastructure is expensive to build and run. |
| Velocity / processing | Analysing fast-arriving streams in time to act on them is technically demanding and pushes hardware hard. |
| Veracity (data quality) | Conclusions are only as trustworthy as the data: noise, gaps, duplication and error can produce confident but wrong results ("garbage in, garbage out"). |
| Privacy and consent | Combining many datasets can re-identify individuals who appeared anonymous in each one, and people rarely understand how far their data travels (a direct link to the Data Protection Act lesson). |
| Bias | Patterns learned from historical data can entrench historical unfairness, so an analysis can be statistically valid yet ethically harmful. |
| Security | A single vast store of sensitive data is a high-value target; a breach is correspondingly serious. |
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.