You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Google Cloud Dataflow is a fully managed service for executing Apache Beam data processing pipelines. It handles both batch and streaming workloads, automatically scaling resources to match the volume of data being processed. Dataflow is deeply integrated with Pub/Sub, making it the primary choice for building real-time stream processing pipelines on Google Cloud.
Dataflow is Google Cloud's implementation of the Apache Beam programming model. Apache Beam provides a unified API for defining data processing pipelines that can run on multiple execution engines (runners). Dataflow is one such runner — and it is Google's fully managed, cloud-native runner.
| Characteristic | Description |
|---|---|
| Fully managed | No cluster provisioning, scaling, or maintenance |
| Unified | Same API for batch and streaming processing |
| Autoscaling | Workers scale up and down based on workload |
| Apache Beam | Open-source SDK in Java, Python, and Go |
| Exactly-once | Guarantees exactly-once processing in streaming mode |
| Integrated | Native connectors for Pub/Sub, BigQuery, GCS, Bigtable, and more |
A pipeline is the top-level container for a data processing job. It defines the sequence of transformations applied to input data.
A PCollection (parallel collection) is a distributed dataset. It can be bounded (batch) or unbounded (streaming):
| Type | Description | Example |
|---|---|---|
| Bounded | Finite dataset with known size | CSV files in GCS |
| Unbounded | Infinite stream of data | Messages from Pub/Sub |
Transforms are the processing operations applied to PCollections:
| Transform | Description |
|---|---|
| ParDo | Apply a function to each element (like map) |
| GroupByKey | Group elements by key |
| CoGroupByKey | Join multiple PCollections by key |
| Combine | Aggregate elements (sum, count, average) |
| Flatten | Merge multiple PCollections into one |
| Partition | Split a PCollection into multiple PCollections |
| Window | Group streaming elements into time-based windows |
Dataflow provides built-in connectors for reading from and writing to external systems:
| Source/Sink | Read | Write |
|---|---|---|
| Pub/Sub | ReadFromPubSub | WriteToPubSub |
| BigQuery | ReadFromBigQuery | WriteToBigQuery |
| Cloud Storage | ReadFromText, ReadFromAvro | WriteToText, WriteToAvro |
| Bigtable | ReadFromBigtable | WriteToBigtable |
| Kafka | ReadFromKafka | WriteToKafka |
| JDBC | ReadFromJdbc | WriteToJdbc |
Dataflow's streaming mode processes unbounded data from sources like Pub/Sub in real time. This is the most common pattern for Pub/Sub consumers that need complex transformations.
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(
streaming=True,
project="my-project",
runner="DataflowRunner",
region="europe-west1",
temp_location="gs://my-bucket/temp",
)
with beam.Pipeline(options=options) as pipeline:
(
pipeline
| "Read from Pub/Sub" >> beam.io.ReadFromPubSub(
subscription="projects/my-project/subscriptions/orders-sub"
)
| "Parse JSON" >> beam.Map(parse_json)
| "Filter High Value" >> beam.Filter(lambda order: order["total"] > 1000)
| "Write to BigQuery" >> beam.io.WriteToBigQuery(
"my-project:analytics.high_value_orders",
schema="order_id:STRING,total:FLOAT,timestamp:TIMESTAMP",
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
)
)
Streaming pipelines process data in windows — fixed intervals of time:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.