βBig dataβ used to mean Hadoop clusters and bespoke MapReduce code. Today it means cloud data warehouses, managed Spark, streaming engines, and a mature ecosystem thatβs accessible without building a Hadoop cluster.
This lesson surveys the patterns and tools for processing data at scale β TB to PB β and how they fit into modern SaaS data platforms.
The Two Modes
Batch processing
Process bounded data periodically
Run analyses on yesterday's data overnight. ETL pipelines, ML training, compliance reports. Simple programming model, high throughput, latency in hours.
Stream processing
Process unbounded data continuously
Process events as they arrive. Real-time dashboards, fraud detection, personalization. Complex programming model (windowing, late events), latency in seconds-minutes.
Open-source columnar OLAP. Extremely fast for analytical queries. Self-hosted or managed (ClickHouse Cloud, Aiven).
For most SaaS, the analytical workload lives in a cloud DW (Snowflake/BigQuery/Redshift) rather than a Spark cluster. Spark/Flink are reserved for things the warehouse canβt do efficiently.
Single streaming pipeline. Treats batch as a bounded stream. Reprocessing means replaying the stream from the beginning. Simpler operationally; requires processing engine that can handle replay.
The pattern most modern SaaS uses. The warehouse is the central hub; tools around it handle ingestion, transformation, activation, BI, ML. Less ops than Spark/Flink for analytical use cases.
ETL vs. ELT
ETL (classic)
Extract β Transform β Load
Transform data before loading into the warehouse. Fits when storage was expensive and compute cheap (the historical default).
ELT (modern)
Extract β Load β Transform
Load raw data first, transform in-warehouse with SQL. Cloud warehouses make this cheap. dbt is the canonical tool. Easier to iterate, better lineage.
For most modern SaaS data stacks, ELT with dbt is the right answer. Raw data in the warehouse is reusable; SQL-defined transformations are version-controlled and testable.
Streaming Concepts Youβll Need
β±οΈ
Event time vs. processing time
Event time = when the event actually happened. Processing time = when the system processes it. Almost always you want event time.
πͺ
Windows
Group events into bounded sets β tumbling (fixed), sliding (overlapping), session (gap-based). The fundamental aggregation primitive in streaming.
π
Watermarks
Heuristic to determine "we've probably seen all events up to time T." Trade-off: late events vs. result latency.
πΎ
State
Aggregations require state. Stream processors must checkpoint state for failure recovery and exactly-once semantics.
π
Backfilling
Reprocess historical data through a streaming pipeline. Requires storage that lets you rewind (Kafka, S3 with time-partitioning).
π
Compacted topics
Kafka feature that keeps the latest value per key. Useful for current-state caches built from streams.
Choosing for SaaS
Use case
Default choice
Analytical queries on TB data
Cloud warehouse (Snowflake/BigQuery/Redshift)
Heavy ETL transforms
dbt or Spark on managed (Databricks/EMR)
Real-time analytics, fraud detection
Flink (managed via Confluent or Ververica)
Stream processing in AWS
Kinesis Data Analytics or Flink on EKS
Stream processing in GCP
Dataflow (Beam)
Greenfield data platform
Modern data stack (warehouse + dbt + Fivetran)
For most SaaS, the analytical workload should live in a cloud warehouse. Reach for streaming engines (Flink, Spark Streaming) when you have genuine real-time requirements.
Recap
Two modes: batch (bounded data, periodic) vs. streaming (unbounded, continuous). Modern frameworks unify both.
Major frameworks: Spark (workhorse), Flink (streaming specialist), Beam (portable API).
Cloud warehouses (Snowflake, BigQuery, Redshift) handle most analytical workloads.
Architectural patterns: Lambda (two pipelines), Kappa (one streaming pipeline), Modern Data Stack (warehouse-centric).
ELT > ETL in modern stacks. dbt is the SQL transformation standard.
Streaming concepts: event time, windows, watermarks, state, backfilling.
For SaaS: cloud warehouse first. Streaming engines for real-time needs.