Circuit Breakers
A circuit breaker is the resilience pattern that stops your service from making things worse. When a downstream service is failing, circuit breakers fail fast instead of piling up requests that will time out β protecting both the downstream from overload and your own service from resource exhaustion.
The pattern is named after electrical circuit breakers, and the analogy is exact: a fuse that trips when overloaded.
The Three States
ββββββββββββββββββ
β Closed β
β (normal flow) β
βββββββββ¬βββββββββ
β failure threshold exceeded
βΌ
ββββββββββββββββββ
β Open β
ββββββββββββββ (reject fast) β
β ββββββββββ¬ββββββββ
β β cooldown elapsed
β βΌ
β ββββββββββββββββββ
β β Half-Open β
β β (probe slowly) β
β βββββ¬ββββββββββ¬βββ
β success β β failure
β βββββββββββ ββββββββββ
β β β
βΌ βΌ βΌ
to "Closed" (back to "Open")
Configuration
A typical circuit breaker has these knobs:
What Counts as a Failure?
Not all errors should trip the breaker. Tune carefully:
A common mistake: tripping the breaker on 4xx. The downstream is healthy; you have a bad request. Donβt conflate βI sent something wrongβ with βdownstream is broken.β
Combining with Retries and Timeouts
Circuit breakers, retries, and timeouts compose. Each addresses a different failure mode.
The order:
1. Set timeout. (always)
2. Wrap call in retry logic with backoff. (for transient issues)
3. Wrap retry logic in a circuit breaker. (to stop retrying when broken)
Where to Put Circuit Breakers
Implementation Options
Modern systems often combine: service mesh handles default per-service breakers; app code adds per-call breakers for specific external dependencies.
Half-Open Probing
The recovery dance matters:
Combining with Bulkheads
The next lesson covers bulkheads β limiting how much of your resources any one downstream can consume. Circuit breakers + bulkheads together are the foundation of cascading-failure protection.
Recap
- Circuit breakers fail fast when a downstream is broken β protecting both your service and the downstream.
- Three states: Closed (normal), Open (rejecting), Half-Open (probing for recovery).
- Configure: failure threshold, cooldown duration, probe count, window size.
- Trip on connection / timeout / 5xx errors. Donβt trip on 4xx.
- Combine with timeouts (always set them) and retries with backoff + jitter.
- Place around external APIs, service-to-service calls, sometimes DB and broker.
- Implementation: library-based (Resilience4j, Polly) or service mesh (Istio, Linkerd) β modern systems often use both.
- The pattern that prevents retry storms β one of the most important resilience patterns in production SaaS.