Scalability and Data Modeling in Databases

10 min read · Updated 2026-04-25

Database scalability isn’t just about handling more data — it’s about maintaining performance, reliability, and reasonable cost as the system grows. Real database choices involve fundamental trade-offs between consistency, performance, and complexity.

This lesson is the framework for evaluating database choices: workload patterns first, then data model, then scaling approach.

The Forces in Tension

Read vs. write patterns
Read-heavy (content, dashboards) vs. write-heavy (IoT, logs, events). Each has different scaling strategies.
Vertical vs. horizontal
Bigger boxes (simpler, hard limits) vs. more boxes (harder, scales further).
Replication vs. sharding
Replication scales reads. Sharding scales writes. Production typically combines both.
Consistency vs. performance
Strong consistency = synchronization overhead. Eventual consistency = faster, but stale views.

Relational Databases — Still the Default

Predictability is the strength. The schema is upfront and explicit. SQL is one of the most powerful query languages ever designed.

Modern Postgres and MySQL have evolved past the “NoSQL killers” — they support JSON documents, full-text search, geospatial, time-series. For most SaaS apps, start with Postgres.

Scaling reads

Postgres is single-master for writes; reads scale via streaming replication. Multiple read replicas at 10-50ms latency, thousands of QPS each.

AWS Aurora
Postgres/MySQL with separated compute/storage. Up to 15 read replicas, 3-5x throughput of vanilla Postgres. Aurora Serverless v2 for variable load.
Citus / Vitess
Distribute tables across nodes. Citus for Postgres, Vitess for MySQL. Tens of thousands of writes/sec while keeping SQL.
Native partitioning
Postgres / MySQL native partitioning. Single DB, multiple physical tables. Cheap; well-supported.

Scaling writes

Single-master is the constraint. Options:

Document Databases (MongoDB)

Designed for horizontal scaling from day one. Sharded clusters distribute writes across multiple primaries.

Strengths
Where MongoDB shines
Flexible schemas, high write throughput, well-defined sharding keys (user ID, tenant ID, time-based). Social platforms, IoT ingestion, multi-tenant SaaS.
Costs
Where it bites
Shard key choice is critical — bad keys create hot spots. No joins; data shape decisions ripple through the codebase. Transaction support exists but more limited than RDBMS.

Wide-Column Stores (Cassandra, ScyllaDB)

Peer-to-peer architecture, no master. Optimized for massive write throughput and multi-DC.

Append-optimized
Commit log + memtable + SSTable. Tens of thousands of writes per second per node. Predictable write latency.
Multi-DC native
Tunable replication strategies across data centers. Each region a "rack" or "DC."
Tunable consistency
Per-query consistency: ANY, ONE, QUORUM, ALL. Trade latency for stronger guarantees per operation.
Linear scalability
Add nodes, capacity grows linearly. Used by Netflix, Apple, Discord at massive scale.

The cost: limited query flexibility (must design table layout per query pattern), no joins, eventual consistency by default.

Key-Value Stores

Redis
In-memory, multi-purpose
Sub-ms latency. Strings, hashes, lists, sets, streams, pub/sub. Used as cache, session store, rate limiter, leaderboards. Up to ~100k+ ops/sec on single node.
DynamoDB
Managed, infinite scale
AWS-managed key-value with optional document features. Single-digit ms latency at any scale. Pay-per-request or provisioned. Global tables for multi-region.

Time-Series Databases

Specialized for high-volume time-stamped data — IoT sensors, monitoring metrics, financial ticks.

InfluxDB
Purpose-built for time-series. Built-in retention policies, downsampling, time-windowed queries.
TimescaleDB
Postgres extension. Time-series performance + full SQL. Good for teams wanting Postgres ergonomics.
Amazon Timestream
AWS-managed time-series. Tiered storage (hot in memory, cold in S3-like).

Search Databases

For full-text search, complex filtering, log analysis.

Elasticsearch / OpenSearch
Inverted indexes for fast full-text search. Aggregations for analytics. The default for log analysis (ELK stack).
Algolia / Meilisearch
Hosted search-as-a-service. Tuned defaults; less flexibility than Elasticsearch but dramatically simpler operations.
Postgres + pg_trgm / Trigram
For modest search needs, Postgres can do full-text search well. Saves running a separate system.

Analytics / OLAP

OLTP databases optimize for many small transactions. OLAP databases optimize for few large analytical queries scanning massive amounts of data.

Columnar storage
Snowflake, BigQuery, Redshift. Data stored column-by-column for compression and analytical query performance.
Lakehouse
Delta Lake, Iceberg, Hudi. Data lake (S3) + transactional metadata layer. Time travel, schema evolution.
ClickHouse
Open-source columnar DB. Extremely fast for analytical queries on huge datasets.
DuckDB
Embedded analytical DB (think SQLite for OLAP). Great for in-process analytics.

Graph Databases

For relationship-heavy data: knowledge graphs, fraud detection, recommendation engines.

NewSQL: SQL Without the Single-Master Limit

A category that emerged to combine SQL with horizontal scale.

Spanner
Google's globally distributed SQL. TrueTime API for tight clock sync. Strongly consistent across regions.
CockroachDB
Open-source globally distributed SQL. Postgres-compatible API. HLC for time. Used by SaaS at scale.
YugabyteDB
Postgres-compatible distributed SQL. Multi-region active-active.
TiDB
MySQL-compatible distributed SQL. Strong fit for MySQL migrations.

Choosing for Your Workload

A pragmatic decision tree for SaaS apps:

Use caseDefault choice
Tenant-scoped transactional dataPostgres (Aurora at scale)
User sessions, caching, rate limitsRedis
Search, full-text, log analysisElasticsearch / OpenSearch
Time-series metricsTimescale or InfluxDB
Analytics warehouseSnowflake / BigQuery / Redshift
Object storageS3
High-throughput append-onlyKafka + DynamoDB / Cassandra
Distributed multi-region OLTPCockroachDB / Spanner

Most production SaaS platforms run multiple databases, each tuned for what it’s best at. The “one database to rule them all” pattern doesn’t survive scale.

Recap