Redundancy is the foundation of fault tolerance. Everything else in this section β rate limiting, circuit breakers, bulkheads β assumes you already have multiple copies of the things that can fail. Without redundancy, the rest is window dressing.
Levels of Redundancy
π₯οΈ
Hardware
Multiple servers, RAID disks, redundant power supplies. The original "redundancy" β keeps the bare metal running.
π
Network
Multiple ISPs, multiple paths through the cloud. Loss of one link or peering doesn't take you down.
π
Availability Zone
Run across 2-3 AZs in a region. Loss of one AZ doesn't take down the service.
π
Region
Multi-region deployment. Loss of an entire AWS/GCP/Azure region is rare but real.
π’
Provider
Multi-cloud. Loss of an entire cloud provider. Expensive, complex; reserved for the most critical systems.
βοΈ
Service instances
Multiple replicas of each service. Pod replicas in K8s, ASG instances. The everyday level.
Active-Active vs. Active-Passive
Active-Active
All instances serve traffic
Load balancer spreads traffic across all replicas. Failure = one instance drops out, others keep serving. Most efficient (no idle capacity).
Active-Passive
Standby takes over on failure
Primary serves traffic; passive standby idle until failover. Simpler consistency story for stateful systems. Wasted capacity unless it's a hot standby.
Most stateless services run active-active. Stateful systems (primary databases, leader-elected services) often run active-passive with automated failover.
N+1, N+2, 2N
How much extra capacity?
β
N+1
Enough capacity for normal load (N) plus one extra. Survives one failure without degradation. The standard for most services.
β
N+2
Two extra. Survives two simultaneous failures, or one failure during a planned upgrade. Used for critical services.
βοΈ
2N
Double capacity. Active-active across two equal sites. Used when you need full capacity even after losing half the infrastructure.
Redundancy in the Data Tier
This is where redundancy gets interesting:
π
Replication
Multiple copies of the same data. Sync (slow writes, no loss) or async (fast writes, possible loss).
π
Replica failover
When the primary fails, a follower is promoted. Manual or automatic via Raft/Paxos.
πΎ
Backups
Backups are not the same as replicas. Replicas can fail correlated; bug or operator error replicates instantly. Always have point-in-time backups.
π
Multi-region replication
Cross-region replicas for DR. Eventually consistent across regions; balance against cost and complexity.
We covered replication patterns in Sharding and Replication. The high-level: replicas + backups, not just replicas.
Stateless vs. Stateful
The complexity of redundancy depends on state.
Stateless services
Easy to make redundant
Just run multiple replicas behind a load balancer. Replicas are interchangeable. Add or remove without coordination. Most application services.
Stateful services
Hard to make redundant
Each replica has unique state. Failover requires coordination β leader election, state transfer, consistency guarantees. Databases, brokers, leader-elected coordinators.
This is why βmake it stateless and externalize stateβ is a recurring architectural recommendation. State in databases or caches; services themselves stateless.
Failure Detection
Redundancy without detection is meaningless. You need to know when something has failed: