Circuit Breakers

7 min read · Updated 2026-04-25

A circuit breaker is the resilience pattern that stops your service from making things worse. When a downstream service is failing, circuit breakers fail fast instead of piling up requests that will time out — protecting both the downstream from overload and your own service from resource exhaustion.

The pattern is named after electrical circuit breakers, and the analogy is exact: a fuse that trips when overloaded.

The Three States

Closed (normal)

Requests flow through. Track success/failure rate.

Open (broken)

Failure threshold exceeded. Reject requests immediately. Don't hit the downstream.

Half-Open (probing)

After cooldown, allow a few test requests through. If they succeed, close. If they fail, back to open.

                ┌────────────────┐
                │     Closed     │
                │  (normal flow) │
                └───────┬────────┘
                        │ failure threshold exceeded
                        ▼
                ┌────────────────┐
                │      Open      │
   ┌────────────│ (reject fast)  │
   │            └────────┬───────┘
   │                     │ cooldown elapsed
   │                     ▼
   │            ┌────────────────┐
   │            │   Half-Open    │
   │            │ (probe slowly) │
   │            └───┬─────────┬──┘
   │       success │         │ failure
   │     ┌─────────┘         └────────┐
   │     │                            │
   ▼     ▼                            ▼
   to "Closed"            (back to "Open")

Configuration

A typical circuit breaker has these knobs:

Failure threshold

How many failures (or what percentage) trip the breaker? Common: 50% failure rate over 100 requests, or 5 consecutive failures.

Cooldown / timeout

How long does the breaker stay open before probing? Common: 30-60 seconds.

Probe count

How many requests allowed through in half-open state before deciding? Common: 1-10.

Window size

Time or count window for the failure threshold. Common: rolling 10-second or 100-request window.

What Counts as a Failure?

Not all errors should trip the breaker. Tune carefully:

Should trip

Connection timeouts, 5xx errors, request timeouts, network errors. Things indicating the downstream is broken.

Should NOT trip

4xx errors (client error, downstream is fine). 401/403 (auth issue, not downstream health). Validation errors.

A common mistake: tripping the breaker on 4xx. The downstream is healthy; you have a bad request. Don’t conflate “I sent something wrong” with “downstream is broken.”

Combining with Retries and Timeouts

Circuit breakers, retries, and timeouts compose. Each addresses a different failure mode.

Timeouts

Bound how long you wait. No timeout = thread pool exhaustion under slow downstream. Set per-call timeouts. Always.

Retries

Handle transient failures (network blip). Use exponential backoff + jitter. Limit retry count.

Circuit breaker

Skip retry attempts when the breaker is open. Don't hammer a service known to be down.

The order:

1. Set timeout. (always)
2. Wrap call in retry logic with backoff. (for transient issues)
3. Wrap retry logic in a circuit breaker. (to stop retrying when broken)

Where to Put Circuit Breakers

External API calls

Stripe, Twilio, SendGrid. Anything outside your control. Always. They will have outages; protect yourself.

Service-to-service calls

In a microservices system, between services. Especially around critical paths.

Database calls

Less common, but useful when DB can fail in ways your app can't recover from quickly. Read replicas particularly.

Message broker writes

When the broker is down, fail fast. Buffer in a backup queue or dead-letter mechanism.

Implementation Options

Library-based

In-process

Hystrix (deprecated, but the original), Resilience4j (Java), Polly (.NET), pybreaker (Python). Embedded in app code. Per-method granularity.

Service mesh

Infrastructure-level

Istio, Linkerd. Configured declaratively, enforced by sidecars. No app code changes. Less granular but consistent across all services and languages.

Modern systems often combine: service mesh handles default per-service breakers; app code adds per-call breakers for specific external dependencies.

Half-Open Probing

The recovery dance matters:

Naive: send all traffic

Recipe for re-failure

Open → cooldown → close fully. If downstream is still wobbly, full traffic kills it again. Saw-tooth pattern: 30s up, 30s down.

Probe slowly

Half-open with limited concurrency

Half-open → 1-10 probe requests with limited concurrency. If they succeed, gradually ramp up. Smooth recovery.

Combining with Bulkheads

The next lesson covers bulkheads — limiting how much of your resources any one downstream can consume. Circuit breakers + bulkheads together are the foundation of cascading-failure protection.

Recap

Circuit breakers fail fast when a downstream is broken — protecting both your service and the downstream.
Three states: Closed (normal), Open (rejecting), Half-Open (probing for recovery).
Configure: failure threshold, cooldown duration, probe count, window size.
Trip on connection / timeout / 5xx errors. Don’t trip on 4xx.
Combine with timeouts (always set them) and retries with backoff + jitter.
Place around external APIs, service-to-service calls, sometimes DB and broker.
Implementation: library-based (Resilience4j, Polly) or service mesh (Istio, Linkerd) — modern systems often use both.
The pattern that prevents retry storms — one of the most important resilience patterns in production SaaS.