Scalability and Data Modeling in Databases
Database scalability isn’t just about handling more data — it’s about maintaining performance, reliability, and reasonable cost as the system grows. Real database choices involve fundamental trade-offs between consistency, performance, and complexity.
This lesson is the framework for evaluating database choices: workload patterns first, then data model, then scaling approach.
The Forces in Tension
Relational Databases — Still the Default
Predictability is the strength. The schema is upfront and explicit. SQL is one of the most powerful query languages ever designed.
Modern Postgres and MySQL have evolved past the “NoSQL killers” — they support JSON documents, full-text search, geospatial, time-series. For most SaaS apps, start with Postgres.
Scaling reads
Postgres is single-master for writes; reads scale via streaming replication. Multiple read replicas at 10-50ms latency, thousands of QPS each.
Scaling writes
Single-master is the constraint. Options:
- Vertical scaling — bigger box. Postgres can handle huge workloads on serious hardware. Don’t underestimate this.
- Read offload — move heavy reads off the primary. Replicas, materialized views, OLAP DBs.
- Sharding — Citus, Vitess, or app-level. Significantly more complex; reach for it when you’ve genuinely outgrown a single Postgres.
- NewSQL — CockroachDB, Spanner, YugabyteDB. SQL with horizontal write scale, at operational cost.
Document Databases (MongoDB)
Designed for horizontal scaling from day one. Sharded clusters distribute writes across multiple primaries.
Wide-Column Stores (Cassandra, ScyllaDB)
Peer-to-peer architecture, no master. Optimized for massive write throughput and multi-DC.
The cost: limited query flexibility (must design table layout per query pattern), no joins, eventual consistency by default.
Key-Value Stores
Time-Series Databases
Specialized for high-volume time-stamped data — IoT sensors, monitoring metrics, financial ticks.
Search Databases
For full-text search, complex filtering, log analysis.
Analytics / OLAP
OLTP databases optimize for many small transactions. OLAP databases optimize for few large analytical queries scanning massive amounts of data.
Graph Databases
For relationship-heavy data: knowledge graphs, fraud detection, recommendation engines.
- Neo4j — the dominant property-graph DB. Cypher query language.
- Amazon Neptune — managed graph DB on AWS. Property graph (Gremlin) and RDF (SPARQL).
- JanusGraph — distributed graph DB on top of Cassandra/HBase.
NewSQL: SQL Without the Single-Master Limit
A category that emerged to combine SQL with horizontal scale.
Choosing for Your Workload
A pragmatic decision tree for SaaS apps:
| Use case | Default choice |
|---|---|
| Tenant-scoped transactional data | Postgres (Aurora at scale) |
| User sessions, caching, rate limits | Redis |
| Search, full-text, log analysis | Elasticsearch / OpenSearch |
| Time-series metrics | Timescale or InfluxDB |
| Analytics warehouse | Snowflake / BigQuery / Redshift |
| Object storage | S3 |
| High-throughput append-only | Kafka + DynamoDB / Cassandra |
| Distributed multi-region OLTP | CockroachDB / Spanner |
Most production SaaS platforms run multiple databases, each tuned for what it’s best at. The “one database to rule them all” pattern doesn’t survive scale.
Recap
- Database scalability is about workload patterns first, data model second, scaling approach third.
- Relational (Postgres, MySQL) is the right default for most SaaS — modern features cover most NoSQL use cases.
- Document, wide-column, key-value, time-series, search, analytics, graph — each shines in specific use cases.
- NewSQL (Spanner, CockroachDB) gives you SQL with horizontal write scale, at operational cost.
- Real systems use multiple databases. The skill is matching access patterns to database strengths, not picking one for everything.