Dataflow Overview

Google Cloud Dataflow is a fully managed service for executing Apache Beam data processing pipelines. It handles both batch and streaming workloads, automatically scaling resources to match the volume of data being processed. Dataflow is deeply integrated with Pub/Sub, making it the primary choice for building real-time stream processing pipelines on Google Cloud.

What Is Dataflow?

Dataflow is Google Cloud's implementation of the Apache Beam programming model. Apache Beam provides a unified API for defining data processing pipelines that can run on multiple execution engines (runners). Dataflow is one such runner — and it is Google's fully managed, cloud-native runner.

Key Characteristics

Characteristic	Description
Fully managed	No cluster provisioning, scaling, or maintenance
Unified	Same API for batch and streaming processing
Autoscaling	Workers scale up and down based on workload
Apache Beam	Open-source SDK in Java, Python, and Go
Exactly-once	Guarantees exactly-once processing in streaming mode
Integrated	Native connectors for Pub/Sub, BigQuery, GCS, Bigtable, and more

Core Concepts

Pipelines

A pipeline is the top-level container for a data processing job. It defines the sequence of transformations applied to input data.

PCollections

A PCollection (parallel collection) is a distributed dataset. It can be bounded (batch) or unbounded (streaming):

Type	Description	Example
Bounded	Finite dataset with known size	CSV files in GCS
Unbounded	Infinite stream of data	Messages from Pub/Sub

Transforms

Transforms are the processing operations applied to PCollections:

Transform	Description
ParDo	Apply a function to each element (like map)
GroupByKey	Group elements by key
CoGroupByKey	Join multiple PCollections by key
Combine	Aggregate elements (sum, count, average)
Flatten	Merge multiple PCollections into one
Partition	Split a PCollection into multiple PCollections
Window	Group streaming elements into time-based windows

I/O Connectors

Dataflow provides built-in connectors for reading from and writing to external systems:

Source/Sink	Read	Write
Pub/Sub	ReadFromPubSub	WriteToPubSub
BigQuery	ReadFromBigQuery	WriteToBigQuery
Cloud Storage	ReadFromText, ReadFromAvro	WriteToText, WriteToAvro
Bigtable	ReadFromBigtable	WriteToBigtable
Kafka	ReadFromKafka	WriteToKafka
JDBC	ReadFromJdbc	WriteToJdbc

Streaming with Pub/Sub

Dataflow's streaming mode processes unbounded data from sources like Pub/Sub in real time. This is the most common pattern for Pub/Sub consumers that need complex transformations.

Simple Streaming Pipeline (Python)

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions(
    streaming=True,
    project="my-project",
    runner="DataflowRunner",
    region="europe-west1",
    temp_location="gs://my-bucket/temp",
)

with beam.Pipeline(options=options) as pipeline:
    (
        pipeline
        | "Read from Pub/Sub" >> beam.io.ReadFromPubSub(
            subscription="projects/my-project/subscriptions/orders-sub"
        )
        | "Parse JSON" >> beam.Map(parse_json)
        | "Filter High Value" >> beam.Filter(lambda order: order["total"] > 1000)
        | "Write to BigQuery" >> beam.io.WriteToBigQuery(
            "my-project:analytics.high_value_orders",
            schema="order_id:STRING,total:FLOAT,timestamp:TIMESTAMP",
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        )
    )

Windowing

Streaming pipelines process data in windows — fixed intervals of time:

Dataflow Overview

Dataflow Overview

What Is Dataflow?

Key Characteristics

Core Concepts

Pipelines

PCollections

Transforms

I/O Connectors

Streaming with Pub/Sub

Simple Streaming Pipeline (Python)

Windowing

More in Cloud