Redundancy

7 min read Β· Updated 2026-04-25

Redundancy is the foundation of fault tolerance. Everything else in this section β€” rate limiting, circuit breakers, bulkheads β€” assumes you already have multiple copies of the things that can fail. Without redundancy, the rest is window dressing.

Levels of Redundancy

Hardware
Multiple servers, RAID disks, redundant power supplies. The original "redundancy" β€” keeps the bare metal running.
Network
Multiple ISPs, multiple paths through the cloud. Loss of one link or peering doesn't take you down.
Availability Zone
Run across 2-3 AZs in a region. Loss of one AZ doesn't take down the service.
Region
Multi-region deployment. Loss of an entire AWS/GCP/Azure region is rare but real.
Provider
Multi-cloud. Loss of an entire cloud provider. Expensive, complex; reserved for the most critical systems.
Service instances
Multiple replicas of each service. Pod replicas in K8s, ASG instances. The everyday level.

Active-Active vs. Active-Passive

Active-Active
All instances serve traffic
Load balancer spreads traffic across all replicas. Failure = one instance drops out, others keep serving. Most efficient (no idle capacity).
Active-Passive
Standby takes over on failure
Primary serves traffic; passive standby idle until failover. Simpler consistency story for stateful systems. Wasted capacity unless it's a hot standby.

Most stateless services run active-active. Stateful systems (primary databases, leader-elected services) often run active-passive with automated failover.

N+1, N+2, 2N

How much extra capacity?

N+1
Enough capacity for normal load (N) plus one extra. Survives one failure without degradation. The standard for most services.
N+2
Two extra. Survives two simultaneous failures, or one failure during a planned upgrade. Used for critical services.
2N
Double capacity. Active-active across two equal sites. Used when you need full capacity even after losing half the infrastructure.

Redundancy in the Data Tier

This is where redundancy gets interesting:

Replication
Multiple copies of the same data. Sync (slow writes, no loss) or async (fast writes, possible loss).
Replica failover
When the primary fails, a follower is promoted. Manual or automatic via Raft/Paxos.
Backups
Backups are not the same as replicas. Replicas can fail correlated; bug or operator error replicates instantly. Always have point-in-time backups.
Multi-region replication
Cross-region replicas for DR. Eventually consistent across regions; balance against cost and complexity.

We covered replication patterns in Sharding and Replication. The high-level: replicas + backups, not just replicas.

Stateless vs. Stateful

The complexity of redundancy depends on state.

Stateless services
Easy to make redundant
Just run multiple replicas behind a load balancer. Replicas are interchangeable. Add or remove without coordination. Most application services.
Stateful services
Hard to make redundant
Each replica has unique state. Failover requires coordination β€” leader election, state transfer, consistency guarantees. Databases, brokers, leader-elected coordinators.

This is why β€œmake it stateless and externalize state” is a recurring architectural recommendation. State in databases or caches; services themselves stateless.

Failure Detection

Redundancy without detection is meaningless. You need to know when something has failed:

Health checks
Load balancer probes each instance. Failed probes β†’ instance removed from rotation. Fast detection (seconds).
Heartbeats
Inter-service liveness signals. Used by leader-elected services and clusters. Tunable detection windows.
External monitoring
Synthetic monitoring from outside (Pingdom, Datadog Synthetic). Catches failures load balancer can't see.
Metrics-based alerting
Sustained error rate, latency spike, queue backup. Symptoms that something is wrong even when health checks say "OK."

Common Anti-patterns

Single points of failure (SPOFs)
A single load balancer, a single DNS provider, a single auth service. Find and eliminate them β€” they negate everything else.
Mirror SPOFs
Both replicas in the same AZ. Both regions in the same provider. Both providers in the same continent. Redundancy that's correlated isn't redundancy.
Untested failover
A failover that's never been exercised will fail when needed. Run game days. Practice the chaos.
Dependency without redundancy
Your service is HA, but it depends on a single external API. Your availability is bounded by their availability β€” usually below your SLA.

Practical Redundancy for SaaS

A typical setup:

Region: us-east-1

  β”Œβ”€β”€ AZ 1a ───┐  β”Œβ”€β”€ AZ 1b ───┐  β”Œβ”€β”€ AZ 1c ───┐
  β”‚ App pods Γ— Nβ”‚  β”‚ App pods Γ— Nβ”‚  β”‚ App pods Γ— Nβ”‚
  β”‚ DB primary  β”‚  β”‚ DB replica  β”‚  β”‚ DB replica  β”‚
  β”‚ Cache       β”‚  β”‚ Cache       β”‚  β”‚ Cache       β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        ↑                ↑                ↑
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 ALB across all 3 AZs

Region: us-west-2 (DR)
  Async replication of DB; on-demand stand-up of app tier
  Backups in S3 with cross-region replication

This survives:

What it doesn’t survive cheaply: cloud-provider-wide outages. For most SaaS, that’s an acceptable risk.

Recap