Big Data Processing

8 min read · Updated 2026-04-25

“Big data” used to mean Hadoop clusters and bespoke MapReduce code. Today it means cloud data warehouses, managed Spark, streaming engines, and a mature ecosystem that’s accessible without building a Hadoop cluster.

This lesson surveys the patterns and tools for processing data at scale — TB to PB — and how they fit into modern SaaS data platforms.

The Two Modes

Batch processing

Process bounded data periodically

Run analyses on yesterday's data overnight. ETL pipelines, ML training, compliance reports. Simple programming model, high throughput, latency in hours.

Stream processing

Process unbounded data continuously

Process events as they arrive. Real-time dashboards, fraud detection, personalization. Complex programming model (windowing, late events), latency in seconds-minutes.

Major Frameworks

Apache Spark — The Workhorse

In-memory computation

10-100x faster than Hadoop MapReduce. Resilient Distributed Datasets (RDDs) cache intermediate results.

Polyglot APIs

Scala (native), Python (PySpark), R, Java, SQL. PySpark is what most people use today.

Unified APIs

Same code for batch (DataFrame), streaming (Structured Streaming), ML (MLlib), graph (GraphX), SQL.

Managed offerings

Databricks (the company spun out of the Spark team), AWS EMR, Google Dataproc. Spark without managing your own cluster.

Apache Flink — The Streaming Specialist

Streaming-first design

Built around event time, watermarks, exactly-once semantics. Often considered the strongest streaming engine.

Low latency

Sub-second latency at scale. Used by Netflix, Uber, Alibaba for real-time personalization and fraud detection.

Stateful operators

First-class state management with checkpointing. Complex stateful streaming operations work cleanly.

Steeper learning curve

Concepts (event time, watermarks, windows) take time. Worth it when streaming is the core workload.

Apache Beam — The Portable API

Write once, run anywhere

Beam pipelines run on Spark, Flink, Google Dataflow, AWS Kinesis Data Analytics. Avoid lock-in to a single runtime.

Unified batch + streaming

Same model for both. The cleanest realization of "batch as streaming with bounded source."

Best fit

Greenfield projects where future portability matters. Multi-cloud teams. Teams that want flexibility without rewriting pipelines.

Cloud Data Warehouses

Snowflake

Decoupled compute/storage. Per-query cluster sizing. Multi-cloud (AWS/Azure/GCP). Often the modern default for analytical workloads.

BigQuery

Google Cloud's serverless warehouse. Pay per query. Column-oriented, massively parallel.

Redshift

AWS warehouse. Newer generation (RA3) decouples compute/storage. Mature ecosystem; tight AWS integration.

ClickHouse

Open-source columnar OLAP. Extremely fast for analytical queries. Self-hosted or managed (ClickHouse Cloud, Aiven).

For most SaaS, the analytical workload lives in a cloud DW (Snowflake/BigQuery/Redshift) rather than a Spark cluster. Spark/Flink are reserved for things the warehouse can’t do efficiently.

Architectural Patterns

Lambda Architecture (Classic)

              ┌─→ [ Batch layer (Spark) ] ─→ [ Batch view ]
[ Source ] ──┤                                      ↘
              └─→ [ Speed layer (streaming) ] → [ Speed view ]
                                                      ↓
                                              [ Serving layer ]

Two parallel pipelines: batch for accuracy and reprocessing, speed for low latency. Outputs merged at query time. Two codebases, double maintenance.

Kappa Architecture (Modern)

[ Source ] → [ Stream processing (Flink) ] → [ State / View ]

Single streaming pipeline. Treats batch as a bounded stream. Reprocessing means replaying the stream from the beginning. Simpler operationally; requires processing engine that can handle replay.

Modern Data Stack (Pragmatic)

[ Sources ]
    │
    ▼ (Fivetran, Airbyte, custom CDC)
[ Cloud Warehouse: Snowflake/BigQuery/Redshift ]
    │
    ├─► [ Transformation: dbt (SQL) ]
    │
    ├─► [ Reverse ETL (Hightouch, Census) ] → [ Operational systems ]
    │
    ├─► [ BI (Looker, Mode, Tableau) ]
    │
    └─► [ ML platforms (Snowpark, BigQuery ML, SageMaker) ]

The pattern most modern SaaS uses. The warehouse is the central hub; tools around it handle ingestion, transformation, activation, BI, ML. Less ops than Spark/Flink for analytical use cases.

ETL vs. ELT

ETL (classic)

Extract → Transform → Load

Transform data before loading into the warehouse. Fits when storage was expensive and compute cheap (the historical default).

ELT (modern)

Extract → Load → Transform

Load raw data first, transform in-warehouse with SQL. Cloud warehouses make this cheap. dbt is the canonical tool. Easier to iterate, better lineage.

For most modern SaaS data stacks, ELT with dbt is the right answer. Raw data in the warehouse is reusable; SQL-defined transformations are version-controlled and testable.

Streaming Concepts You’ll Need

Event time vs. processing time

Event time = when the event actually happened. Processing time = when the system processes it. Almost always you want event time.

Windows

Group events into bounded sets — tumbling (fixed), sliding (overlapping), session (gap-based). The fundamental aggregation primitive in streaming.

Watermarks

Heuristic to determine "we've probably seen all events up to time T." Trade-off: late events vs. result latency.

State

Aggregations require state. Stream processors must checkpoint state for failure recovery and exactly-once semantics.

Backfilling

Reprocess historical data through a streaming pipeline. Requires storage that lets you rewind (Kafka, S3 with time-partitioning).

Compacted topics

Kafka feature that keeps the latest value per key. Useful for current-state caches built from streams.

Choosing for SaaS

Use case	Default choice
Analytical queries on TB data	Cloud warehouse (Snowflake/BigQuery/Redshift)
Heavy ETL transforms	dbt or Spark on managed (Databricks/EMR)
Real-time analytics, fraud detection	Flink (managed via Confluent or Ververica)
Stream processing in AWS	Kinesis Data Analytics or Flink on EKS
Stream processing in GCP	Dataflow (Beam)
Greenfield data platform	Modern data stack (warehouse + dbt + Fivetran)

For most SaaS, the analytical workload should live in a cloud warehouse. Reach for streaming engines (Flink, Spark Streaming) when you have genuine real-time requirements.

Recap

Two modes: batch (bounded data, periodic) vs. streaming (unbounded, continuous). Modern frameworks unify both.
Major frameworks: Spark (workhorse), Flink (streaming specialist), Beam (portable API).
Cloud warehouses (Snowflake, BigQuery, Redshift) handle most analytical workloads.
Architectural patterns: Lambda (two pipelines), Kappa (one streaming pipeline), Modern Data Stack (warehouse-centric).
ELT > ETL in modern stacks. dbt is the SQL transformation standard.
Streaming concepts: event time, windows, watermarks, state, backfilling.
For SaaS: cloud warehouse first. Streaming engines for real-time needs.