Data Architecture, Pipelines, and ETL

10 min read · Updated 2026-04-25

For cloud SaaS applications, data architecture is the foundation that enables personalization, analytics, compliance, and real-time operations. As applications grow from thousands to millions of users, the path data takes — from generation to consumption — becomes increasingly complex.

This lesson follows the natural data lifecycle: ingestion → storage → security → DR → processing → ML.

The Hard Problems

Modern SaaS faces data problems that classic enterprise systems mostly didn’t:

Scale + performance

Millions of concurrent users, sub-second response times, consistency across distributed systems.

Workload diversity

OLTP needs low latency. Analytics needs throughput. ML needs both historical depth and real-time features. Observability needs fast ingest + fast query.

Multi-tenancy

Isolate tenant data without managing thousands of separate systems.

Global distribution

Consistent performance regardless of user location. Data replication, edge optimization, latency-aware routing.

Compliance

GDPR, HIPAA, SOC 2 must be implemented at the architecture level — not bolted on.

Ingestion

How data enters the system. Three primary patterns:

Batch

Scheduled, high-volume

Nightly warehouse loads, monthly compliance reports, ML training. Works for use cases where latency doesn't matter and volume does.

Streaming

Real-time, event-driven

Fraud detection, real-time dashboards, personalization engines. Events processed as they arrive.

Micro-batch splits the difference: small batches every few minutes for near-real-time analytics. Many production systems combine all three — stream critical user events, batch reference data nightly.

Streaming platforms

Apache Kafka

The dominant choice. Powers Netflix recommendations, Uber pricing. Append-only log; high throughput, low latency, replay capability.

Cloud-managed

Amazon Kinesis, Azure Event Hubs, Google Pub/Sub, AWS MSK. Same patterns, less operational overhead.

Apache Pulsar

Multi-tenant, geo-replication built in. Used at Yahoo, Tencent. Strong fit for multi-region SaaS.

These platforms handle millions of events per second while preserving order and durability. The append-only log model means once data is written, it stays for replay and recovery — making debugging dramatically easier.

Batch platforms

Apache Airflow

Standard orchestrator for complex data workflows. Visual DAGs make pipeline structure clear.

Cloud ETL

AWS Glue (serverless ETL), Azure Data Factory, Google Dataflow (unified batch + stream).

Modern Python

Prefect, Dagster — Python-first orchestrators with better testing and debugging UX.

Storage

Storage system categories

Operational databases

Real-time low-latency app data. Postgres, MySQL, MongoDB, DynamoDB, time-series DBs (InfluxDB, Timescale), search (Elasticsearch).

Data lakes

Raw data in original format, no upfront schema. S3, Azure Data Lake Storage, GCS, HDFS. For exploration and diverse analytics.

Data warehouses

Structured data optimized for analytics. Columnar storage (Parquet, ORC). Snowflake, BigQuery, Redshift. Pre-aggregated views, smart partitioning.

Lakehouses

Metadata layer over data lakes adding ACID, time travel, schema evolution. Delta Lake, Apache Iceberg, Apache Hudi. Lake flexibility + warehouse reliability.

The lakehouse value

The lakehouse pattern earns its keep through:

Schema evolution — handle changing data structures without breaking existing queries.
Time travel — query historical versions for audit and debugging.
ACID transactions — consistency during complex multi-step operations.
Unified analytics — single system for batch and stream workloads.

Security

Security isn’t an afterthought in modern data architecture — it’s a foundational requirement that shapes every design decision.

Data classification

Public, internal, confidential, restricted. Different controls per class. User emails ≠ anonymized usage metrics.

Encryption everywhere

At rest (AES-256), in transit (TLS), increasingly in use (homomorphic encryption for processing encrypted data). Cloud KMS makes this near-default.

Zero trust access

Every data request authenticated and authorized. Fine-grained permissions (row-level, column-level). ABAC over simple RBAC.

Audit trails

Every access logged. Critical for forensics and compliance. Immutable storage for audit logs themselves.

Privacy-preserving techniques

Anonymization — strip PII before analytics use.
Pseudonymization — replace identifiers with tokens; reversible only with secure mapping.
Differential privacy — add controlled noise to aggregate stats; protect individuals while preserving statistical accuracy.
Synthetic data — generate fake data with the statistical properties of real data, for testing/dev.

Disaster Recovery

DR planning is required, not optional. Two key metrics:

RTO — Recovery Time Objective

How fast must we recover?

How long can the system be down before unacceptable business impact? Influences architecture (active-passive vs. active-active).

RPO — Recovery Point Objective

How much data can we lose?

What's the maximum acceptable data loss window? Drives backup frequency and replication strategy.

DR strategies, from cheapest to most expensive:

Strategy	RTO	RPO	Cost
Backup & restore	Hours-days	Hours	Low
Pilot light	Tens of minutes	Minutes	Medium
Warm standby	Minutes	Seconds-minutes	Medium-high
Active-active	Seconds	Near-zero	High

For multi-tenant SaaS, regulatory tier and tenant SLA often dictate which strategy applies — different tiers may get different DR commitments.

Processing

Three architectural styles for transformation:

Lambda Architecture

Separate batch and speed layers. Batch for accuracy + reprocessing; speed for low latency. Two codebases — known maintenance cost.

Kappa Architecture

Single stream-processing layer. Treats batch as a special case of streaming. Simpler operations, requires reprocessing through stream.

Modern unified

Apache Beam, Dataflow, Flink. One codebase that runs in batch or streaming mode. Best of both worlds; tooling still maturing.

ELT vs. ETL

Modern warehouses prefer Extract → Load → Transform — load raw data, transform in-warehouse with SQL. Cheaper to iterate than traditional ETL.

ML and AI Considerations

Modern SaaS data architectures need to support ML workloads alongside traditional analytics.

Feature stores

Centralized repositories for ML features. Reuse features across models. Tecton, Feast, AWS SageMaker Feature Store.

Online + offline parity

Same feature definition for training (offline) and serving (online). Prevents training-serving skew.

Real-time features

Compute features on streaming data. Aggregations over recent windows. Used for fraud, personalization, recommendations.

Vector databases

Pinecone, Weaviate, pgvector, Qdrant. For embeddings — semantic search, RAG, recommendations. The new must-have for AI applications.

Multi-Tenancy in Data Layer

For multi-tenant SaaS, data architecture decisions interact with tenancy:

Per-tenant data warehouses for big enterprise tenants — easy GDPR-style data deletion.
Shared warehouse with tenant_id partitioning for the long tail — efficient at scale.
Hybrid — usually correct in practice.

The key tension: cost-efficiency wants shared infrastructure; isolation/compliance wants separation. Most platforms blend.

Recap

The data lifecycle: ingest → store → secure → DR-protect → process → ML.
Ingestion patterns: batch, streaming, micro-batch. Most production systems mix all three.
Storage categories: operational DBs, data lakes, warehouses, lakehouses. None of these alone is enough; you’ll likely use all four.
Security is foundational: classification, encryption, zero trust, audit. Privacy techniques (anonymization, differential privacy) protect individuals.
DR is non-negotiable; tier strategies by criticality (cost-RTO trade-off).
Processing: Lambda, Kappa, or modern unified. ELT > ETL in modern warehouse stacks.
ML adds new requirements: feature stores, vector DBs, online-offline parity.
Multi-tenancy interacts with every layer; per-tenant separation for isolation, shared infrastructure for efficiency.