Data Architecture, Pipelines, and ETL

10 min read Β· Updated 2026-04-25

For cloud SaaS applications, data architecture is the foundation that enables personalization, analytics, compliance, and real-time operations. As applications grow from thousands to millions of users, the path data takes β€” from generation to consumption β€” becomes increasingly complex.

This lesson follows the natural data lifecycle: ingestion β†’ storage β†’ security β†’ DR β†’ processing β†’ ML.

The Hard Problems

Modern SaaS faces data problems that classic enterprise systems mostly didn’t:

Scale + performance
Millions of concurrent users, sub-second response times, consistency across distributed systems.
Workload diversity
OLTP needs low latency. Analytics needs throughput. ML needs both historical depth and real-time features. Observability needs fast ingest + fast query.
Multi-tenancy
Isolate tenant data without managing thousands of separate systems.
Global distribution
Consistent performance regardless of user location. Data replication, edge optimization, latency-aware routing.
Compliance
GDPR, HIPAA, SOC 2 must be implemented at the architecture level β€” not bolted on.

Ingestion

How data enters the system. Three primary patterns:

Batch
Scheduled, high-volume
Nightly warehouse loads, monthly compliance reports, ML training. Works for use cases where latency doesn't matter and volume does.
Streaming
Real-time, event-driven
Fraud detection, real-time dashboards, personalization engines. Events processed as they arrive.

Micro-batch splits the difference: small batches every few minutes for near-real-time analytics. Many production systems combine all three β€” stream critical user events, batch reference data nightly.

Streaming platforms

Apache Kafka
The dominant choice. Powers Netflix recommendations, Uber pricing. Append-only log; high throughput, low latency, replay capability.
Cloud-managed
Amazon Kinesis, Azure Event Hubs, Google Pub/Sub, AWS MSK. Same patterns, less operational overhead.
Apache Pulsar
Multi-tenant, geo-replication built in. Used at Yahoo, Tencent. Strong fit for multi-region SaaS.

These platforms handle millions of events per second while preserving order and durability. The append-only log model means once data is written, it stays for replay and recovery β€” making debugging dramatically easier.

Batch platforms

Apache Airflow
Standard orchestrator for complex data workflows. Visual DAGs make pipeline structure clear.
Cloud ETL
AWS Glue (serverless ETL), Azure Data Factory, Google Dataflow (unified batch + stream).
Modern Python
Prefect, Dagster β€” Python-first orchestrators with better testing and debugging UX.

Storage

Storage system categories

Operational databases
Real-time low-latency app data. Postgres, MySQL, MongoDB, DynamoDB, time-series DBs (InfluxDB, Timescale), search (Elasticsearch).
Data lakes
Raw data in original format, no upfront schema. S3, Azure Data Lake Storage, GCS, HDFS. For exploration and diverse analytics.
Data warehouses
Structured data optimized for analytics. Columnar storage (Parquet, ORC). Snowflake, BigQuery, Redshift. Pre-aggregated views, smart partitioning.
Lakehouses
Metadata layer over data lakes adding ACID, time travel, schema evolution. Delta Lake, Apache Iceberg, Apache Hudi. Lake flexibility + warehouse reliability.

The lakehouse value

The lakehouse pattern earns its keep through:

Security

Security isn’t an afterthought in modern data architecture β€” it’s a foundational requirement that shapes every design decision.

Data classification
Public, internal, confidential, restricted. Different controls per class. User emails β‰  anonymized usage metrics.
Encryption everywhere
At rest (AES-256), in transit (TLS), increasingly in use (homomorphic encryption for processing encrypted data). Cloud KMS makes this near-default.
Zero trust access
Every data request authenticated and authorized. Fine-grained permissions (row-level, column-level). ABAC over simple RBAC.
Audit trails
Every access logged. Critical for forensics and compliance. Immutable storage for audit logs themselves.

Privacy-preserving techniques

Disaster Recovery

DR planning is required, not optional. Two key metrics:

RTO β€” Recovery Time Objective
How fast must we recover?
How long can the system be down before unacceptable business impact? Influences architecture (active-passive vs. active-active).
RPO β€” Recovery Point Objective
How much data can we lose?
What's the maximum acceptable data loss window? Drives backup frequency and replication strategy.

DR strategies, from cheapest to most expensive:

StrategyRTORPOCost
Backup & restoreHours-daysHoursLow
Pilot lightTens of minutesMinutesMedium
Warm standbyMinutesSeconds-minutesMedium-high
Active-activeSecondsNear-zeroHigh

For multi-tenant SaaS, regulatory tier and tenant SLA often dictate which strategy applies β€” different tiers may get different DR commitments.

Processing

Three architectural styles for transformation:

Lambda Architecture
Separate batch and speed layers. Batch for accuracy + reprocessing; speed for low latency. Two codebases β€” known maintenance cost.
Kappa Architecture
Single stream-processing layer. Treats batch as a special case of streaming. Simpler operations, requires reprocessing through stream.
Modern unified
Apache Beam, Dataflow, Flink. One codebase that runs in batch or streaming mode. Best of both worlds; tooling still maturing.
ELT vs. ETL
Modern warehouses prefer Extract β†’ Load β†’ Transform β€” load raw data, transform in-warehouse with SQL. Cheaper to iterate than traditional ETL.

ML and AI Considerations

Modern SaaS data architectures need to support ML workloads alongside traditional analytics.

Feature stores
Centralized repositories for ML features. Reuse features across models. Tecton, Feast, AWS SageMaker Feature Store.
Online + offline parity
Same feature definition for training (offline) and serving (online). Prevents training-serving skew.
Real-time features
Compute features on streaming data. Aggregations over recent windows. Used for fraud, personalization, recommendations.
Vector databases
Pinecone, Weaviate, pgvector, Qdrant. For embeddings β€” semantic search, RAG, recommendations. The new must-have for AI applications.

Multi-Tenancy in Data Layer

For multi-tenant SaaS, data architecture decisions interact with tenancy:

The key tension: cost-efficiency wants shared infrastructure; isolation/compliance wants separation. Most platforms blend.

Recap