Circuit Breakers

7 min read Β· Updated 2026-04-25

A circuit breaker is the resilience pattern that stops your service from making things worse. When a downstream service is failing, circuit breakers fail fast instead of piling up requests that will time out β€” protecting both the downstream from overload and your own service from resource exhaustion.

The pattern is named after electrical circuit breakers, and the analogy is exact: a fuse that trips when overloaded.

The Three States

Closed (normal)
Requests flow through. Track success/failure rate.
Open (broken)
Failure threshold exceeded. Reject requests immediately. Don't hit the downstream.
Half-Open (probing)
After cooldown, allow a few test requests through. If they succeed, close. If they fail, back to open.
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚     Closed     β”‚
                β”‚  (normal flow) β”‚
                β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚ failure threshold exceeded
                        β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚      Open      β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚ (reject fast)  β”‚
   β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚                     β”‚ cooldown elapsed
   β”‚                     β–Ό
   β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚            β”‚   Half-Open    β”‚
   β”‚            β”‚ (probe slowly) β”‚
   β”‚            β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”˜
   β”‚       success β”‚         β”‚ failure
   β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         └────────┐
   β”‚     β”‚                            β”‚
   β–Ό     β–Ό                            β–Ό
   to "Closed"            (back to "Open")

Configuration

A typical circuit breaker has these knobs:

Failure threshold
How many failures (or what percentage) trip the breaker? Common: 50% failure rate over 100 requests, or 5 consecutive failures.
Cooldown / timeout
How long does the breaker stay open before probing? Common: 30-60 seconds.
Probe count
How many requests allowed through in half-open state before deciding? Common: 1-10.
Window size
Time or count window for the failure threshold. Common: rolling 10-second or 100-request window.

What Counts as a Failure?

Not all errors should trip the breaker. Tune carefully:

Should trip
Connection timeouts, 5xx errors, request timeouts, network errors. Things indicating the downstream is broken.
Should NOT trip
4xx errors (client error, downstream is fine). 401/403 (auth issue, not downstream health). Validation errors.

A common mistake: tripping the breaker on 4xx. The downstream is healthy; you have a bad request. Don’t conflate β€œI sent something wrong” with β€œdownstream is broken.”

Combining with Retries and Timeouts

Circuit breakers, retries, and timeouts compose. Each addresses a different failure mode.

Timeouts
Bound how long you wait. No timeout = thread pool exhaustion under slow downstream. Set per-call timeouts. Always.
Retries
Handle transient failures (network blip). Use exponential backoff + jitter. Limit retry count.
Circuit breaker
Skip retry attempts when the breaker is open. Don't hammer a service known to be down.

The order:

1. Set timeout. (always)
2. Wrap call in retry logic with backoff. (for transient issues)
3. Wrap retry logic in a circuit breaker. (to stop retrying when broken)

Where to Put Circuit Breakers

External API calls
Stripe, Twilio, SendGrid. Anything outside your control. Always. They will have outages; protect yourself.
Service-to-service calls
In a microservices system, between services. Especially around critical paths.
Database calls
Less common, but useful when DB can fail in ways your app can't recover from quickly. Read replicas particularly.
Message broker writes
When the broker is down, fail fast. Buffer in a backup queue or dead-letter mechanism.

Implementation Options

Library-based
In-process
Hystrix (deprecated, but the original), Resilience4j (Java), Polly (.NET), pybreaker (Python). Embedded in app code. Per-method granularity.
Service mesh
Infrastructure-level
Istio, Linkerd. Configured declaratively, enforced by sidecars. No app code changes. Less granular but consistent across all services and languages.

Modern systems often combine: service mesh handles default per-service breakers; app code adds per-call breakers for specific external dependencies.

Half-Open Probing

The recovery dance matters:

Naive: send all traffic
Recipe for re-failure
Open β†’ cooldown β†’ close fully. If downstream is still wobbly, full traffic kills it again. Saw-tooth pattern: 30s up, 30s down.
Probe slowly
Half-open with limited concurrency
Half-open β†’ 1-10 probe requests with limited concurrency. If they succeed, gradually ramp up. Smooth recovery.

Combining with Bulkheads

The next lesson covers bulkheads β€” limiting how much of your resources any one downstream can consume. Circuit breakers + bulkheads together are the foundation of cascading-failure protection.

Recap