Bulkheads

7 min read Β· Updated 2026-04-25

The bulkhead pattern is named after the watertight compartments in ships. When one compartment floods, the bulkheads keep the water from spreading β€” the ship doesn’t sink because of one breach.

In software, bulkheads partition resources so failures in one part can’t drain resources from the rest. It’s the pattern that prevents one slow downstream from taking down your whole service.

The Problem Bulkheads Solve

Consider a service that calls three downstreams:

                Service A
                /   |   \
               /    |    \
              β–Ό     β–Ό     β–Ό
         Stripe  Sendgrid  PaymentX
        (healthy) (slow)   (healthy)

Without bulkheads, Sendgrid (slow) doesn’t just slow down email β€” it consumes thread pool resources, queues up connections, and eventually starves Stripe and PaymentX calls too. One failing dependency cascades to all of them.

With bulkheads, each dependency has its own resource pool. Sendgrid’s slowness can max out the email pool, but Stripe and PaymentX have separate pools and continue working.

Implementation Levels

Thread pool isolation

The classic bulkhead. Each downstream gets its own thread pool.

Stripe calls    β†’ Thread pool A (10 threads, 50 queue)
Sendgrid calls  β†’ Thread pool B (5 threads, 20 queue)
PaymentX calls  β†’ Thread pool C (10 threads, 50 queue)

When Sendgrid is slow, pool B fills up. New email requests are rejected (or queued briefly). Stripe and PaymentX pools stay healthy.

Implemented in Hystrix, Resilience4j, etc. Memory cost is real (each pool holds threads); use only for the dependencies that warrant it.

Semaphore isolation

Lighter weight: semaphores instead of thread pools.

acquire semaphore (max 10 concurrent calls to Sendgrid)
  β†’ make call
release semaphore

Cheaper than thread pools but doesn’t isolate as strongly (slow Sendgrid calls still happen on calling threads). Often the right answer for high-throughput, low-latency dependencies.

Connection pool isolation

Database connections are precious. Allocate separate pools per service or workload.

DB Pool A: HTTP request handlers (50 connections)
DB Pool B: Background jobs (10 connections)
DB Pool C: Reporting queries (5 connections, max 30s timeout)

A long-running reporting query can’t consume all connections β€” it has its own pool. HTTP handlers stay responsive even when reports are slow.

Process / container isolation

For really strong isolation: separate processes or containers.

Pod A: API requests (autoscaling, latency-critical)
Pod B: Async jobs (separate queue, separate scaling)
Pod C: Reporting (resource-limited, low-priority)

If reporting OOMs, only Pod C is affected. Different K8s resource limits per pod type enforce this.

Bulkheads for Multi-Tenancy

In multi-tenant SaaS, per-tenant bulkheads prevent noisy-neighbor problems at infrastructure level.

Per-tenant connection pools
Each tenant gets a slice of DB connections. One tenant's slow queries can't starve other tenants.
Per-tenant request quotas
Application-level. Limit concurrent in-flight requests per tenant. Combined with rate limiting.
Tier-based pools
Free tier shares one pool; pro tier shares another; enterprise gets dedicated. Resources allocated by tier.
Dedicated infrastructure
Big enterprise tenants get dedicated app instances and DB connections. Strongest isolation. Highest cost.

Combining Bulkheads with Other Patterns

Bulkheads compose with the other resilience patterns:

Bulkhead + Circuit Breaker
The classic combo
Bulkhead caps resources. Circuit breaker stops further attempts when the resource is exhausted or downstream is failing. Together: protected, fast-failing.
Bulkhead + Timeout
The boundary
Bulkhead limits concurrent calls. Timeout limits each call duration. Without timeout, bulkhead can fill with stuck calls. Both required.

Sizing Bulkheads

Bulkhead sizes are application-specific. Common heuristics:

Measure first
Look at typical concurrency to that downstream. Size pool 2-3x typical, plus headroom.
Latency Γ— throughput
Threads needed = throughput Γ— p99 latency. 100 RPS at 100ms p99 = 10 threads minimum.
Pool too small
Rejected requests under normal load. Customer impact. Increase.
Pool too big
Memory wasted. Slow downstream still cascades because pool absorbs the bottleneck.

Examples in Production

Netflix Hystrix
The original bulkhead implementation at scale. Per-dependency thread pools, dashboard for live monitoring. (Now in maintenance mode; replaced by Resilience4j.)
AWS service architecture
AWS uses cell-based architectures internally β€” services partitioned into "cells" so failure in one cell doesn't affect others. Bulkhead at the service level.
Twitter's Finagle
Built-in support for connection-pool isolation, timeouts, retries. The model that influenced gRPC and Linkerd.
Service mesh
Istio and Linkerd implement bulkhead-like patterns at the proxy level. Concurrency limits per upstream destination.

When NOT to Use Bulkheads

Simple systems
A single-service app with one DB has no inter-dependency to bulkhead. Default thread/connection pools are enough.
Memory-constrained
Lots of small pools cost memory. If the downstream count is high (50+), reconsider granularity.
Ultra-low-latency
Bulkhead overhead matters when p99 < 1ms. Use semaphores or rely on async I/O instead.

A Practical Pattern

For a typical SaaS service:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Web request handler (3-tier app)        β”‚
β”‚                                         β”‚
β”‚   Bulkhead: 100 concurrent requests     β”‚
β”‚   Per-request timeout: 5s               β”‚
β”‚                                         β”‚
β”‚   β–Ό DB pool A (50 conn, 1s timeout)    β”‚
β”‚   β–Ό Cache pool (no limit, 100ms)        β”‚
β”‚                                         β”‚
β”‚   β–Ό External API calls:                 β”‚
β”‚     - Stripe: pool of 10, breaker on 5xxβ”‚
β”‚     - Sendgrid: pool of 5, fire-forget  β”‚
β”‚     - 3rd-party X: pool of 5, breaker   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each external dependency is bulkheaded; the worker count is bounded; circuit breakers protect against cascading failure.

Recap