Big Data

Some datasets have grown so large, so fast-moving and so messy that the tidy relational database of the previous lessons — neat tables, fixed schema, one machine running ACID transactions — simply cannot cope. Big data is the name for that regime, and for the different tools and ways of thinking it demands. This lesson defines big data through its defining characteristics (the "Vs": volume, velocity and variety, often extended with veracity and value), separates structured from unstructured data, and explains how big data is stored and processed across clusters of many machines rather than one. The technical heart of the topic — and the part most often missed — is why big data is processed using a functional programming style: pure, side-effect-free functions are exactly what make a computation safe to split across hundreds of processors at once. We finish with the challenges (cost, privacy, quality, bias) and the opportunities (insight, prediction, science) that big data brings.

Spec Mapping

This lesson addresses the H446 1.3 content on big data and the role of functional programming in its processing:

Define big data and describe its characteristics (volume, velocity, variety; with veracity and value as common extensions), giving an example of each.
Distinguish structured data (a fixed schema, suited to a relational database) from unstructured and semi-structured data, and explain why most big data is unstructured.
Explain that big data is stored and processed across distributed systems (clusters of commodity machines) rather than a single server, and why this is necessary.
Explain the role of functional programming — pure functions with no side effects, immutable data, and higher-order functions such as map and reduce — in enabling safe parallel processing of big data.
Discuss the challenges and opportunities of big data, including data quality (veracity), privacy, bias and the value of the insight extracted.

(Phrasing here paraphrases the specification content; it is not a verbatim quote.)

What Is Big Data?

Big data refers to datasets so large, so rapidly changing, or so varied in form that traditional data-processing tools and a single machine are inadequate to capture, store, manage and analyse them within a useful time. The phrase is relative, not a fixed threshold: "big" means big enough that the ordinary approach breaks down and a qualitatively different approach — distributed storage, parallel processing, specialised databases — becomes necessary. A spreadsheet of a few thousand sales is not big data however carefully you keep it; the continuous global stream of every search query, sensor reading, transaction and social-media post most certainly is.

It helps to see big data as a consequence of three trends running together: the falling cost of storage (keeping everything became cheaper than deciding what to throw away), the spread of always-connected devices generating data continuously, and the development of frameworks that let ordinary computers be ganged together to process it. None alone creates big data; together they made it both possible and unavoidable.

The Characteristics of Big Data — the "Vs"

The standard way to characterise big data is through a set of properties conventionally beginning with the letter V. The original three are volume, velocity and variety; two more — veracity and value — are very commonly added and are worth knowing.

Characteristic	What it means	Illustration (generic)
Volume	The sheer amount of data — frequently terabytes, petabytes or beyond — far more than a single disk or server could hold or scan.	A large social platform may accumulate new data measured in petabytes over a period; no one machine could store or search it.
Velocity	The speed at which new data arrives and the speed at which it must be processed, often in real time or near-real-time.	Sensor or transaction data streaming in continuously that must be analysed as it arrives, not in an overnight batch.
Variety	The many different formats and types of data, structured and unstructured, that must be handled together.	Text, images, audio, video, GPS coordinates, log files and clickstreams all in one dataset.
Veracity	The trustworthiness and accuracy of the data: it may be noisy, incomplete, inconsistent or simply wrong, and conclusions are only as good as the data behind them.	User-entered fields with typos, sensors that occasionally misreport, duplicated or contradictory records.
Value	The usefulness of the data — the insight or benefit that can actually be extracted from it. Data has no worth in itself; the worth is in what analysing it reveals.	The same raw logs are worthless until analysis turns them into a decision (which product to stock, which fault to fix).

A neat way to remember why each V matters is to pair it with the difficulty it creates: volume challenges storage, velocity challenges processing speed, variety challenges the data model, veracity challenges trust in the result, and value is the justification for tackling the other four at all. Examiners reward answers that explain each V and the problem it poses, not a bare list of the words.

Exam Tip: Learn at least the three core Vs (volume, velocity, variety) with a one-line meaning and an example each; mention veracity and value to show breadth. In a scenario question, apply the Vs to the scenario ("the velocity here is high because readings arrive every second…") rather than reciting generic definitions — the marks are for the application.

Structured, Unstructured and Semi-Structured Data

A central reason big data needs new tools is that most of it does not fit the neat, fixed-column shape a relational database expects.

Feature	Structured data	Unstructured data
Schema	A predefined schema: fixed fields, each with a defined data type.	No predefined schema; the content has internal structure to a human but no fixed fields.
Storage	Relational databases (the SQL tables of earlier lessons).	Data lakes, NoSQL stores, distributed file systems.
Examples	Customer records, exam results, financial transactions.	Free-text documents, emails, images, audio, video, social-media posts.
Querying	Straightforward with SQL and indexes.	Needs specialised processing (text analysis, image/audio recognition) before it can be queried meaningfully.
Share of all data	A minority of the data organisations now hold.	The large majority of it.

Semi-structured data

Between the two sits semi-structured data: it has no rigid table schema but does carry tags or markers that describe its structure, so a program can parse it. The classic examples are JSON and XML, where elements are labelled ({ "name": "Ada", "age": 36 }) but new fields can appear freely, and log files, which follow a loose line-by-line pattern. Semi-structured formats are the lingua franca of big data because they are flexible enough to absorb variety yet machine-readable enough to process at scale.

The key exam point is why this matters: a relational database demands you fix the schema before you store anything, which is impossible when the data is heterogeneous and arrives faster than any schema could be designed. Big-data systems therefore often store data raw first and impose structure later, at read time ("schema-on-read"), the opposite of the relational "schema-on-write" you met with normalisation.

Distributed Storage and Processing

No single computer can store petabytes or scan them quickly enough, so big data is held and processed on a distributed system: a cluster of many ordinary ("commodity") machines, called nodes, working together and coordinated by software so that, to the user, they behave as one large system.

flowchart TD
    SRC[Huge dataset<br/>too big for one machine] --> SPLIT[Split into many blocks]
    SPLIT --> N1[Node 1<br/>stores + processes its blocks]
    SPLIT --> N2[Node 2<br/>stores + processes its blocks]
    SPLIT --> N3[Node 3<br/>stores + processes its blocks]
    SPLIT --> N4[Node 4<br/>stores + processes its blocks]
    N1 --> AGG[Combine partial results]
    N2 --> AGG
    N3 --> AGG
    N4 --> AGG
    AGG --> OUT[Final result]

Two ideas make this work:

Distributed storage. The dataset is broken into blocks spread across the nodes' local disks, and — crucially — each block is replicated onto more than one node. This deliberate redundancy (the same resilience idea as RAID and backups from the transaction lesson) means that when a node fails — and in a cluster of thousands, some node is always failing — no data is lost and the work simply continues on a copy.
Distributed (parallel) processing. Rather than pulling all the data to one program, the computation is sent to the data: each node processes the blocks it already holds, locally and simultaneously. Because the nodes work in parallel, doubling the nodes can roughly halve the time — the system scales horizontally by adding more cheap machines, rather than needing one impossibly powerful one.

This "move the computation to the data, process every block in parallel, then combine the answers" pattern is the engine of big data. But it only works if the per-block computations are genuinely independent — which is exactly the guarantee the next section is about.

The Role of Functional Programming in Parallel Processing

This is the part of the topic the specification singles out, and the part candidates most often get wrong. The question is: why is a functional style so well suited to processing big data in parallel?

The answer turns on side effects and shared state. The danger in any parallel computation is that two tasks running at once both read and write the same piece of memory — the race condition and lost-update problems you met with concurrent database transactions. Those problems exist because tasks share mutable state. The locking that fixes them adds waiting, contention and the risk of deadlock, and it does not scale well to thousands of nodes.

Functional programming sidesteps the whole problem by removing the shared mutable state in the first place:

Functional property	What it means	Why it enables safe parallelism
Pure functions	A function's output depends only on its inputs, and it has no side effects (it changes nothing outside itself).	Calling the same function on the same data always gives the same answer, whenever and wherever it runs — so the work can be moved to any node and run in any order with no surprises.
No shared mutable state	Functions do not read or write global variables that other tasks also touch.	With nothing shared to corrupt, there are no race conditions and no need for locks — the central obstacle to parallelism is gone.
Immutable data	Data is never modified in place; transforming it produces a new value, leaving the original untouched.	Many tasks can read the same input simultaneously and safely, because none can change it underneath the others.
Higher-order functions (map, filter, reduce/fold)	Functions that take other functions as arguments and apply them across a whole collection.	They express a computation as "apply this independent function to every element" — a description that is trivially parallelisable: the elements can be shared out across nodes.

The famous concrete realisation is the MapReduce model. A computation is written as two pure functions:

a map function, applied independently and in parallel to every block of the input, producing intermediate key–value results; and
a reduce function, which combines all the intermediate values for each key into the final result.

flowchart LR
    IN[Input split into blocks] --> M1[map block 1]
    IN --> M2[map block 2]
    IN --> M3[map block 3]
    M1 --> SH[Group by key]
    M2 --> SH
    M3 --> SH
    SH --> R1[reduce key A]
    SH --> R2[reduce key B]
    R1 --> OUT[Combined result]
    R2 --> OUT

Because map is pure and side-effect-free, the framework is free to run it on every block at once, retry it on a different node if a machine dies (re-running a pure function is always safe — it cannot have half-changed anything), and combine the results in any convenient order. That freedom is the whole point. The functional guarantees are precisely what let the framework distribute, parallelise and recover automatically, hiding all of it from the programmer, who need only supply two ordinary functions. An imperative solution full of shared counters and in-place updates would offer none of these guarantees and could not be parallelised so freely or so safely.

Exam Tip: "Why is functional programming used to process big data?" is a stock question. The winning answer is: functions are pure / have no side effects and data is immutable, so there is no shared mutable state, therefore no race conditions and no locks, therefore the work can be split across many processors and run in any order safely (and re-run on failure). Tie it explicitly to map and reduce.

Challenges and Opportunities

Big data is neither a magic wand nor a menace — like most powerful technology it is both, and a good answer weighs the two.

Challenges

Challenge	Explanation
Storage and cost	Holding and replicating petabytes needs large clusters, electricity and skilled staff; the infrastructure is expensive to build and run.
Velocity / processing	Analysing fast-arriving streams in time to act on them is technically demanding and pushes hardware hard.
Veracity (data quality)	Conclusions are only as trustworthy as the data: noise, gaps, duplication and error can produce confident but wrong results ("garbage in, garbage out").
Privacy and consent	Combining many datasets can re-identify individuals who appeared anonymous in each one, and people rarely understand how far their data travels (a direct link to the Data Protection Act lesson).
Bias	Patterns learned from historical data can entrench historical unfairness, so an analysis can be statistically valid yet ethically harmful.
Security	A single vast store of sensitive data is a high-value target; a breach is correspondingly serious.

Big Data

Big Data

Spec Mapping

What Is Big Data?

The Characteristics of Big Data — the "Vs"

Structured, Unstructured and Semi-Structured Data

Semi-structured data

Distributed Storage and Processing

The Role of Functional Programming in Parallel Processing

Challenges and Opportunities

Challenges

Opportunities

More in Computer Science