You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Big Data refers to datasets that are so large, complex, or fast-changing that traditional data processing methods cannot handle them effectively. The functional programming paradigm plays a key role in big data processing because its emphasis on immutability, statelessness, and parallelism makes it well-suited for distributed computing.
Big data is commonly characterised by the "Three Vs":
The amount of data being generated and stored. Modern organisations may deal with terabytes, petabytes, or even exabytes of data.
Examples:
The speed at which data is generated, collected, and processed. Some applications require real-time or near-real-time processing.
Examples:
The different types and formats of data. Big data includes structured, semi-structured, and unstructured data.
| Data Type | Description | Examples |
|---|---|---|
| Structured | Organised in tables with defined schemas | Relational databases, spreadsheets |
| Semi-structured | Has some organisation but no rigid schema | JSON, XML, CSV files |
| Unstructured | No predefined format | Text documents, images, videos, audio, emails |
Some frameworks extend the model to five or more Vs:
| Challenge | Traditional Approach | Big Data Reality |
|---|---|---|
| Storage | Single server / database | Data exceeds capacity of any single machine |
| Processing | Sequential processing | Too slow for the volume of data |
| Structure | Fixed schemas (SQL) | Data comes in many formats |
| Speed | Batch processing | Real-time processing needed |
Distributed computing solves the big data problem by spreading data and processing across many machines (a cluster) that work together.
| Concept | Description |
|---|---|
| Cluster | A group of connected computers (nodes) working together |
| Node | A single computer in the cluster |
| Data partitioning | Splitting data across multiple nodes |
| Parallel processing | Multiple nodes process different parts of the data simultaneously |
| Fault tolerance | The system continues to work even if some nodes fail |
| Data replication | Copies of data stored on multiple nodes for reliability |
MapReduce is a programming model for processing large datasets in parallel across a cluster. It was popularised by Google and is directly inspired by the map and reduce (fold) functions from functional programming.
1. Map Phase:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.