Big Data Concepts

Big Data refers to datasets that are so large, complex, or fast-changing that traditional data processing methods cannot handle them effectively. The functional programming paradigm plays a key role in big data processing because its emphasis on immutability, statelessness, and parallelism makes it well-suited for distributed computing.

The Three Vs (and Beyond)

Big data is commonly characterised by the "Three Vs":

Volume

The amount of data being generated and stored. Modern organisations may deal with terabytes, petabytes, or even exabytes of data.

Examples:

Social media platforms generating billions of posts per day.
Scientific instruments (e.g. the Large Hadron Collider) producing petabytes of data.
Retail companies storing years of transaction data for millions of customers.

Velocity

The speed at which data is generated, collected, and processed. Some applications require real-time or near-real-time processing.

Examples:

Stock market trading systems processing millions of transactions per second.
IoT sensors sending continuous streams of data.
Social media feeds updating in real time.

Variety

The different types and formats of data. Big data includes structured, semi-structured, and unstructured data.

Data Type	Description	Examples
Structured	Organised in tables with defined schemas	Relational databases, spreadsheets
Semi-structured	Has some organisation but no rigid schema	JSON, XML, CSV files
Unstructured	No predefined format	Text documents, images, videos, audio, emails

Additional Vs

Some frameworks extend the model to five or more Vs:

Veracity: The accuracy and trustworthiness of the data. Not all big data is reliable.
Value: The usefulness and business value that can be extracted from the data.

Why Traditional Approaches Fail

Challenge	Traditional Approach	Big Data Reality
Storage	Single server / database	Data exceeds capacity of any single machine
Processing	Sequential processing	Too slow for the volume of data
Structure	Fixed schemas (SQL)	Data comes in many formats
Speed	Batch processing	Real-time processing needed

Distributed Computing

Distributed computing solves the big data problem by spreading data and processing across many machines (a cluster) that work together.

Key Concepts

Concept	Description
Cluster	A group of connected computers (nodes) working together
Node	A single computer in the cluster
Data partitioning	Splitting data across multiple nodes
Parallel processing	Multiple nodes process different parts of the data simultaneously
Fault tolerance	The system continues to work even if some nodes fail
Data replication	Copies of data stored on multiple nodes for reliability

MapReduce

MapReduce is a programming model for processing large datasets in parallel across a cluster. It was popularised by Google and is directly inspired by the map and reduce (fold) functions from functional programming.

The Two Phases

1. Map Phase:

The input data is split into chunks.
Each chunk is processed by a map function running on a different node.
The map function transforms each input into a key-value pair.

Big Data Concepts

Big Data Concepts

The Three Vs (and Beyond)

Volume

Velocity

Variety

Additional Vs

Why Traditional Approaches Fail

Distributed Computing

Key Concepts

MapReduce

The Two Phases

More in Computer Science