Big Data Processing

8 min read Β· Updated 2026-04-25

β€œBig data” used to mean Hadoop clusters and bespoke MapReduce code. Today it means cloud data warehouses, managed Spark, streaming engines, and a mature ecosystem that’s accessible without building a Hadoop cluster.

This lesson surveys the patterns and tools for processing data at scale β€” TB to PB β€” and how they fit into modern SaaS data platforms.

The Two Modes

Batch processing
Process bounded data periodically
Run analyses on yesterday's data overnight. ETL pipelines, ML training, compliance reports. Simple programming model, high throughput, latency in hours.
Stream processing
Process unbounded data continuously
Process events as they arrive. Real-time dashboards, fraud detection, personalization. Complex programming model (windowing, late events), latency in seconds-minutes.

Major Frameworks

Apache Spark β€” The Workhorse

In-memory computation
10-100x faster than Hadoop MapReduce. Resilient Distributed Datasets (RDDs) cache intermediate results.
Polyglot APIs
Scala (native), Python (PySpark), R, Java, SQL. PySpark is what most people use today.
Unified APIs
Same code for batch (DataFrame), streaming (Structured Streaming), ML (MLlib), graph (GraphX), SQL.
Managed offerings
Databricks (the company spun out of the Spark team), AWS EMR, Google Dataproc. Spark without managing your own cluster.
Streaming-first design
Built around event time, watermarks, exactly-once semantics. Often considered the strongest streaming engine.
Low latency
Sub-second latency at scale. Used by Netflix, Uber, Alibaba for real-time personalization and fraud detection.
Stateful operators
First-class state management with checkpointing. Complex stateful streaming operations work cleanly.
Steeper learning curve
Concepts (event time, watermarks, windows) take time. Worth it when streaming is the core workload.

Apache Beam β€” The Portable API

Write once, run anywhere
Beam pipelines run on Spark, Flink, Google Dataflow, AWS Kinesis Data Analytics. Avoid lock-in to a single runtime.
Unified batch + streaming
Same model for both. The cleanest realization of "batch as streaming with bounded source."
Best fit
Greenfield projects where future portability matters. Multi-cloud teams. Teams that want flexibility without rewriting pipelines.

Cloud Data Warehouses

Snowflake
Decoupled compute/storage. Per-query cluster sizing. Multi-cloud (AWS/Azure/GCP). Often the modern default for analytical workloads.
BigQuery
Google Cloud's serverless warehouse. Pay per query. Column-oriented, massively parallel.
Redshift
AWS warehouse. Newer generation (RA3) decouples compute/storage. Mature ecosystem; tight AWS integration.
ClickHouse
Open-source columnar OLAP. Extremely fast for analytical queries. Self-hosted or managed (ClickHouse Cloud, Aiven).

For most SaaS, the analytical workload lives in a cloud DW (Snowflake/BigQuery/Redshift) rather than a Spark cluster. Spark/Flink are reserved for things the warehouse can’t do efficiently.

Architectural Patterns

Lambda Architecture (Classic)

              β”Œβ”€β†’ [ Batch layer (Spark) ] ─→ [ Batch view ]
[ Source ] ───                                      β†˜
              └─→ [ Speed layer (streaming) ] β†’ [ Speed view ]
                                                      ↓
                                              [ Serving layer ]

Two parallel pipelines: batch for accuracy and reprocessing, speed for low latency. Outputs merged at query time. Two codebases, double maintenance.

Kappa Architecture (Modern)

[ Source ] β†’ [ Stream processing (Flink) ] β†’ [ State / View ]

Single streaming pipeline. Treats batch as a bounded stream. Reprocessing means replaying the stream from the beginning. Simpler operationally; requires processing engine that can handle replay.

Modern Data Stack (Pragmatic)

[ Sources ]
    β”‚
    β–Ό (Fivetran, Airbyte, custom CDC)
[ Cloud Warehouse: Snowflake/BigQuery/Redshift ]
    β”‚
    β”œβ”€β–Ί [ Transformation: dbt (SQL) ]
    β”‚
    β”œβ”€β–Ί [ Reverse ETL (Hightouch, Census) ] β†’ [ Operational systems ]
    β”‚
    β”œβ”€β–Ί [ BI (Looker, Mode, Tableau) ]
    β”‚
    └─► [ ML platforms (Snowpark, BigQuery ML, SageMaker) ]

The pattern most modern SaaS uses. The warehouse is the central hub; tools around it handle ingestion, transformation, activation, BI, ML. Less ops than Spark/Flink for analytical use cases.

ETL vs. ELT

ETL (classic)
Extract β†’ Transform β†’ Load
Transform data before loading into the warehouse. Fits when storage was expensive and compute cheap (the historical default).
ELT (modern)
Extract β†’ Load β†’ Transform
Load raw data first, transform in-warehouse with SQL. Cloud warehouses make this cheap. dbt is the canonical tool. Easier to iterate, better lineage.

For most modern SaaS data stacks, ELT with dbt is the right answer. Raw data in the warehouse is reusable; SQL-defined transformations are version-controlled and testable.

Streaming Concepts You’ll Need

Event time vs. processing time
Event time = when the event actually happened. Processing time = when the system processes it. Almost always you want event time.
Windows
Group events into bounded sets β€” tumbling (fixed), sliding (overlapping), session (gap-based). The fundamental aggregation primitive in streaming.
Watermarks
Heuristic to determine "we've probably seen all events up to time T." Trade-off: late events vs. result latency.
State
Aggregations require state. Stream processors must checkpoint state for failure recovery and exactly-once semantics.
Backfilling
Reprocess historical data through a streaming pipeline. Requires storage that lets you rewind (Kafka, S3 with time-partitioning).
Compacted topics
Kafka feature that keeps the latest value per key. Useful for current-state caches built from streams.

Choosing for SaaS

Use caseDefault choice
Analytical queries on TB dataCloud warehouse (Snowflake/BigQuery/Redshift)
Heavy ETL transformsdbt or Spark on managed (Databricks/EMR)
Real-time analytics, fraud detectionFlink (managed via Confluent or Ververica)
Stream processing in AWSKinesis Data Analytics or Flink on EKS
Stream processing in GCPDataflow (Beam)
Greenfield data platformModern data stack (warehouse + dbt + Fivetran)

For most SaaS, the analytical workload should live in a cloud warehouse. Reach for streaming engines (Flink, Spark Streaming) when you have genuine real-time requirements.

Recap