Redundancy

7 min read · Updated 2026-04-25

Redundancy is the foundation of fault tolerance. Everything else in this section — rate limiting, circuit breakers, bulkheads — assumes you already have multiple copies of the things that can fail. Without redundancy, the rest is window dressing.

Levels of Redundancy

Hardware

Multiple servers, RAID disks, redundant power supplies. The original "redundancy" — keeps the bare metal running.

Network

Multiple ISPs, multiple paths through the cloud. Loss of one link or peering doesn't take you down.

Availability Zone

Run across 2-3 AZs in a region. Loss of one AZ doesn't take down the service.

Region

Multi-region deployment. Loss of an entire AWS/GCP/Azure region is rare but real.

Provider

Multi-cloud. Loss of an entire cloud provider. Expensive, complex; reserved for the most critical systems.

Service instances

Multiple replicas of each service. Pod replicas in K8s, ASG instances. The everyday level.

Active-Active vs. Active-Passive

Active-Active

All instances serve traffic

Load balancer spreads traffic across all replicas. Failure = one instance drops out, others keep serving. Most efficient (no idle capacity).

Active-Passive

Standby takes over on failure

Primary serves traffic; passive standby idle until failover. Simpler consistency story for stateful systems. Wasted capacity unless it's a hot standby.

Most stateless services run active-active. Stateful systems (primary databases, leader-elected services) often run active-passive with automated failover.

N+1, N+2, 2N

How much extra capacity?

N+1

Enough capacity for normal load (N) plus one extra. Survives one failure without degradation. The standard for most services.

N+2

Two extra. Survives two simultaneous failures, or one failure during a planned upgrade. Used for critical services.

Double capacity. Active-active across two equal sites. Used when you need full capacity even after losing half the infrastructure.

Redundancy in the Data Tier

This is where redundancy gets interesting:

Replication

Multiple copies of the same data. Sync (slow writes, no loss) or async (fast writes, possible loss).

Replica failover

When the primary fails, a follower is promoted. Manual or automatic via Raft/Paxos.

Backups

Backups are not the same as replicas. Replicas can fail correlated; bug or operator error replicates instantly. Always have point-in-time backups.

Multi-region replication

Cross-region replicas for DR. Eventually consistent across regions; balance against cost and complexity.

We covered replication patterns in Sharding and Replication. The high-level: replicas + backups, not just replicas.

Stateless vs. Stateful

The complexity of redundancy depends on state.

Stateless services

Easy to make redundant

Just run multiple replicas behind a load balancer. Replicas are interchangeable. Add or remove without coordination. Most application services.

Stateful services

Hard to make redundant

Each replica has unique state. Failover requires coordination — leader election, state transfer, consistency guarantees. Databases, brokers, leader-elected coordinators.

This is why “make it stateless and externalize state” is a recurring architectural recommendation. State in databases or caches; services themselves stateless.

Failure Detection

Redundancy without detection is meaningless. You need to know when something has failed:

Health checks

Load balancer probes each instance. Failed probes → instance removed from rotation. Fast detection (seconds).

Heartbeats

Inter-service liveness signals. Used by leader-elected services and clusters. Tunable detection windows.

External monitoring

Synthetic monitoring from outside (Pingdom, Datadog Synthetic). Catches failures load balancer can't see.

Metrics-based alerting

Sustained error rate, latency spike, queue backup. Symptoms that something is wrong even when health checks say "OK."

Common Anti-patterns

Single points of failure (SPOFs)

A single load balancer, a single DNS provider, a single auth service. Find and eliminate them — they negate everything else.

Mirror SPOFs

Both replicas in the same AZ. Both regions in the same provider. Both providers in the same continent. Redundancy that's correlated isn't redundancy.

Untested failover

A failover that's never been exercised will fail when needed. Run game days. Practice the chaos.

Dependency without redundancy

Your service is HA, but it depends on a single external API. Your availability is bounded by their availability — usually below your SLA.

Practical Redundancy for SaaS

A typical setup:

Region: us-east-1

  ┌── AZ 1a ───┐  ┌── AZ 1b ───┐  ┌── AZ 1c ───┐
  │ App pods × N│  │ App pods × N│  │ App pods × N│
  │ DB primary  │  │ DB replica  │  │ DB replica  │
  │ Cache       │  │ Cache       │  │ Cache       │
  └─────────────┘  └─────────────┘  └─────────────┘
        ↑                ↑                ↑
        └────────────────┴────────────────┘
                 ALB across all 3 AZs

Region: us-west-2 (DR)
  Async replication of DB; on-demand stand-up of app tier
  Backups in S3 with cross-region replication

This survives:

Any single instance failing.
Any AZ failing entirely.
Software bugs (rollback to known-good image).
Operator error (point-in-time restore).
Region failure (DR plan, maybe minutes/hours of degradation).

What it doesn’t survive cheaply: cloud-provider-wide outages. For most SaaS, that’s an acceptable risk.

Recap

Redundancy is the foundation of fault tolerance — everything else assumes it.
Multiple levels: hardware, network, AZ, region, provider, service instances.
Active-active for stateless; active-passive for stateful with leader election.
Stateless is dramatically easier to make redundant. Externalize state.
Failure detection (health checks, heartbeats, monitoring) is necessary for redundancy to actually help.
Common anti-patterns: SPOFs, correlated failures, untested failover, single-vendor dependencies.
For SaaS: multi-AZ + cross-region DR + backups is usually the right baseline.