MaxCompute and Data Analytics

Alibaba Cloud offers a comprehensive suite of big data and analytics services. At the centre is MaxCompute (formerly ODPS), a fully managed data warehousing solution that can process massive datasets.

What is MaxCompute?

MaxCompute is a serverless, distributed data warehouse that handles petabyte-scale data processing. It is Alibaba Cloud's answer to AWS Redshift, Google BigQuery, or Azure Synapse.

Key Characteristics

Serverless — no infrastructure to manage; pay for compute and storage
Massive scale — processes petabytes of data with thousands of nodes
SQL-based — uses a dialect of SQL for data processing
Integrated — works seamlessly with other Alibaba Cloud data services

What MaxCompute Handles

Capability	Description
SQL queries	Ad-hoc and scheduled analytical queries
ETL processing	Transform and clean large datasets
Machine learning	Built-in ML algorithms via PAI integration
Data storage	Store structured and semi-structured data
Graph computation	Process graph-based data models

MaxCompute Architecture

graph LR
  subgraph DS["Data Sources"]
    OSS["OSS"]
    RDS["RDS"]
    Log["Log Svc"]
    Kafka["Kafka"]
  end
  subgraph MC["MaxCompute"]
    Tables["Tables"]
    Eng["SQL Engine / MapReduce / Spark"]
  end
  subgraph CN["Consumers"]
    BI["BI Tools"]
    Rep["Reports"]
    DataV["DataV"]
    PAI["PAI (ML)"]
  end
  OSS --> MC
  RDS --> MC
  Log --> MC
  Kafka --> MC
  MC --> BI
  MC --> Rep
  MC --> DataV
  MC --> PAI

Projects and Tables

Projects

A project is the basic unit of organisation in MaxCompute:

Contains tables, resources, functions, and jobs
Has its own access control settings
Acts as a billing boundary

Tables

Tables in MaxCompute are similar to relational database tables:

Internal tables — data is stored and managed by MaxCompute
External tables — data remains in external storage (OSS, Tablestore) and is queried in place

Partitions

Partitions divide large tables into smaller segments for efficient querying:

-- Create a partitioned table
CREATE TABLE user_events (
  user_id     STRING,
  event_type  STRING,
  event_data  STRING
)
PARTITIONED BY (dt STRING, region STRING);

-- Query only one partition
SELECT * FROM user_events WHERE dt = '2024-01-15' AND region = 'cn';

Partitioning dramatically reduces the amount of data scanned, lowering both cost and query time.

DataWorks

DataWorks is the integrated data development platform for MaxCompute. It provides:

MaxCompute and Data Analytics

MaxCompute and Data Analytics

What is MaxCompute?

Key Characteristics

What MaxCompute Handles

MaxCompute Architecture

Projects and Tables

Projects

Tables

Partitions

DataWorks

More in Cloud