For cloud SaaS applications, data architecture is the foundation that enables personalization, analytics, compliance, and real-time operations. As applications grow from thousands to millions of users, the path data takes β from generation to consumption β becomes increasingly complex.
This lesson follows the natural data lifecycle: ingestion β storage β security β DR β processing β ML.
The Hard Problems
Modern SaaS faces data problems that classic enterprise systems mostly didnβt:
π
Scale + performance
Millions of concurrent users, sub-second response times, consistency across distributed systems.
π
Workload diversity
OLTP needs low latency. Analytics needs throughput. ML needs both historical depth and real-time features. Observability needs fast ingest + fast query.
ποΈ
Multi-tenancy
Isolate tenant data without managing thousands of separate systems.
π
Global distribution
Consistent performance regardless of user location. Data replication, edge optimization, latency-aware routing.
βοΈ
Compliance
GDPR, HIPAA, SOC 2 must be implemented at the architecture level β not bolted on.
Ingestion
How data enters the system. Three primary patterns:
Batch
Scheduled, high-volume
Nightly warehouse loads, monthly compliance reports, ML training. Works for use cases where latency doesn't matter and volume does.
Streaming
Real-time, event-driven
Fraud detection, real-time dashboards, personalization engines. Events processed as they arrive.
Micro-batch splits the difference: small batches every few minutes for near-real-time analytics. Many production systems combine all three β stream critical user events, batch reference data nightly.
Streaming platforms
π
Apache Kafka
The dominant choice. Powers Netflix recommendations, Uber pricing. Append-only log; high throughput, low latency, replay capability.
βοΈ
Cloud-managed
Amazon Kinesis, Azure Event Hubs, Google Pub/Sub, AWS MSK. Same patterns, less operational overhead.
π
Apache Pulsar
Multi-tenant, geo-replication built in. Used at Yahoo, Tencent. Strong fit for multi-region SaaS.
These platforms handle millions of events per second while preserving order and durability. The append-only log model means once data is written, it stays for replay and recovery β making debugging dramatically easier.
Batch platforms
π³
Apache Airflow
Standard orchestrator for complex data workflows. Visual DAGs make pipeline structure clear.
βοΈ
Cloud ETL
AWS Glue (serverless ETL), Azure Data Factory, Google Dataflow (unified batch + stream).
π
Modern Python
Prefect, Dagster β Python-first orchestrators with better testing and debugging UX.
Raw data in original format, no upfront schema. S3, Azure Data Lake Storage, GCS, HDFS. For exploration and diverse analytics.
ποΈ
Data warehouses
Structured data optimized for analytics. Columnar storage (Parquet, ORC). Snowflake, BigQuery, Redshift. Pre-aggregated views, smart partitioning.
π
Lakehouses
Metadata layer over data lakes adding ACID, time travel, schema evolution. Delta Lake, Apache Iceberg, Apache Hudi. Lake flexibility + warehouse reliability.
The lakehouse value
The lakehouse pattern earns its keep through:
Schema evolution β handle changing data structures without breaking existing queries.
Time travel β query historical versions for audit and debugging.
ACID transactions β consistency during complex multi-step operations.
Unified analytics β single system for batch and stream workloads.
Security
Security isnβt an afterthought in modern data architecture β itβs a foundational requirement that shapes every design decision.
π·οΈ
Data classification
Public, internal, confidential, restricted. Different controls per class. User emails β anonymized usage metrics.
π
Encryption everywhere
At rest (AES-256), in transit (TLS), increasingly in use (homomorphic encryption for processing encrypted data). Cloud KMS makes this near-default.
π‘οΈ
Zero trust access
Every data request authenticated and authorized. Fine-grained permissions (row-level, column-level). ABAC over simple RBAC.
ποΈ
Audit trails
Every access logged. Critical for forensics and compliance. Immutable storage for audit logs themselves.
Privacy-preserving techniques
Anonymization β strip PII before analytics use.
Pseudonymization β replace identifiers with tokens; reversible only with secure mapping.
Differential privacy β add controlled noise to aggregate stats; protect individuals while preserving statistical accuracy.
Synthetic data β generate fake data with the statistical properties of real data, for testing/dev.
Disaster Recovery
DR planning is required, not optional. Two key metrics:
RTO β Recovery Time Objective
How fast must we recover?
How long can the system be down before unacceptable business impact? Influences architecture (active-passive vs. active-active).
RPO β Recovery Point Objective
How much data can we lose?
What's the maximum acceptable data loss window? Drives backup frequency and replication strategy.
DR strategies, from cheapest to most expensive:
Strategy
RTO
RPO
Cost
Backup & restore
Hours-days
Hours
Low
Pilot light
Tens of minutes
Minutes
Medium
Warm standby
Minutes
Seconds-minutes
Medium-high
Active-active
Seconds
Near-zero
High
For multi-tenant SaaS, regulatory tier and tenant SLA often dictate which strategy applies β different tiers may get different DR commitments.
Processing
Three architectural styles for transformation:
π΅
Lambda Architecture
Separate batch and speed layers. Batch for accuracy + reprocessing; speed for low latency. Two codebases β known maintenance cost.
π΄
Kappa Architecture
Single stream-processing layer. Treats batch as a special case of streaming. Simpler operations, requires reprocessing through stream.
π’
Modern unified
Apache Beam, Dataflow, Flink. One codebase that runs in batch or streaming mode. Best of both worlds; tooling still maturing.
π‘
ELT vs. ETL
Modern warehouses prefer Extract β Load β Transform β load raw data, transform in-warehouse with SQL. Cheaper to iterate than traditional ETL.
ML and AI Considerations
Modern SaaS data architectures need to support ML workloads alongside traditional analytics.
π―
Feature stores
Centralized repositories for ML features. Reuse features across models. Tecton, Feast, AWS SageMaker Feature Store.
π
Online + offline parity
Same feature definition for training (offline) and serving (online). Prevents training-serving skew.
β‘
Real-time features
Compute features on streaming data. Aggregations over recent windows. Used for fraud, personalization, recommendations.
π§¬
Vector databases
Pinecone, Weaviate, pgvector, Qdrant. For embeddings β semantic search, RAG, recommendations. The new must-have for AI applications.
Multi-Tenancy in Data Layer
For multi-tenant SaaS, data architecture decisions interact with tenancy:
Per-tenant data warehouses for big enterprise tenants β easy GDPR-style data deletion.
Shared warehouse with tenant_id partitioning for the long tail β efficient at scale.
Hybrid β usually correct in practice.
The key tension: cost-efficiency wants shared infrastructure; isolation/compliance wants separation. Most platforms blend.
Recap
The data lifecycle: ingest β store β secure β DR-protect β process β ML.
Ingestion patterns: batch, streaming, micro-batch. Most production systems mix all three.
Storage categories: operational DBs, data lakes, warehouses, lakehouses. None of these alone is enough; youβll likely use all four.
Security is foundational: classification, encryption, zero trust, audit. Privacy techniques (anonymization, differential privacy) protect individuals.
DR is non-negotiable; tier strategies by criticality (cost-RTO trade-off).
Processing: Lambda, Kappa, or modern unified. ELT > ETL in modern warehouse stacks.
ML adds new requirements: feature stores, vector DBs, online-offline parity.
Multi-tenancy interacts with every layer; per-tenant separation for isolation, shared infrastructure for efficiency.