Bulkheads
The bulkhead pattern is named after the watertight compartments in ships. When one compartment floods, the bulkheads keep the water from spreading β the ship doesnβt sink because of one breach.
In software, bulkheads partition resources so failures in one part canβt drain resources from the rest. Itβs the pattern that prevents one slow downstream from taking down your whole service.
The Problem Bulkheads Solve
Consider a service that calls three downstreams:
Service A
/ | \
/ | \
βΌ βΌ βΌ
Stripe Sendgrid PaymentX
(healthy) (slow) (healthy)
Without bulkheads, Sendgrid (slow) doesnβt just slow down email β it consumes thread pool resources, queues up connections, and eventually starves Stripe and PaymentX calls too. One failing dependency cascades to all of them.
With bulkheads, each dependency has its own resource pool. Sendgridβs slowness can max out the email pool, but Stripe and PaymentX have separate pools and continue working.
Implementation Levels
Thread pool isolation
The classic bulkhead. Each downstream gets its own thread pool.
Stripe calls β Thread pool A (10 threads, 50 queue)
Sendgrid calls β Thread pool B (5 threads, 20 queue)
PaymentX calls β Thread pool C (10 threads, 50 queue)
When Sendgrid is slow, pool B fills up. New email requests are rejected (or queued briefly). Stripe and PaymentX pools stay healthy.
Implemented in Hystrix, Resilience4j, etc. Memory cost is real (each pool holds threads); use only for the dependencies that warrant it.
Semaphore isolation
Lighter weight: semaphores instead of thread pools.
acquire semaphore (max 10 concurrent calls to Sendgrid)
β make call
release semaphore
Cheaper than thread pools but doesnβt isolate as strongly (slow Sendgrid calls still happen on calling threads). Often the right answer for high-throughput, low-latency dependencies.
Connection pool isolation
Database connections are precious. Allocate separate pools per service or workload.
DB Pool A: HTTP request handlers (50 connections)
DB Pool B: Background jobs (10 connections)
DB Pool C: Reporting queries (5 connections, max 30s timeout)
A long-running reporting query canβt consume all connections β it has its own pool. HTTP handlers stay responsive even when reports are slow.
Process / container isolation
For really strong isolation: separate processes or containers.
Pod A: API requests (autoscaling, latency-critical)
Pod B: Async jobs (separate queue, separate scaling)
Pod C: Reporting (resource-limited, low-priority)
If reporting OOMs, only Pod C is affected. Different K8s resource limits per pod type enforce this.
Bulkheads for Multi-Tenancy
In multi-tenant SaaS, per-tenant bulkheads prevent noisy-neighbor problems at infrastructure level.
Combining Bulkheads with Other Patterns
Bulkheads compose with the other resilience patterns:
Sizing Bulkheads
Bulkhead sizes are application-specific. Common heuristics:
Examples in Production
When NOT to Use Bulkheads
A Practical Pattern
For a typical SaaS service:
βββββββββββββββββββββββββββββββββββββββββββ
β Web request handler (3-tier app) β
β β
β Bulkhead: 100 concurrent requests β
β Per-request timeout: 5s β
β β
β βΌ DB pool A (50 conn, 1s timeout) β
β βΌ Cache pool (no limit, 100ms) β
β β
β βΌ External API calls: β
β - Stripe: pool of 10, breaker on 5xxβ
β - Sendgrid: pool of 5, fire-forget β
β - 3rd-party X: pool of 5, breaker β
βββββββββββββββββββββββββββββββββββββββββββ
Each external dependency is bulkheaded; the worker count is bounded; circuit breakers protect against cascading failure.
Recap
- Bulkheads partition resources so one failure canβt drain resources for the rest.
- Implementation levels: thread pools, semaphores, connection pools, processes/containers.
- Multi-tenant SaaS uses bulkheads per-tenant or per-tier to prevent noisy-neighbor cascades.
- Compose with timeouts (bound each call) and circuit breakers (stop retrying).
- Size based on throughput Γ p99 latency; measure and tune.
- Donβt bulkhead everything β overhead matters. Use for genuinely independent dependencies.