Bulkhead Pattern in Microservices: Keeping Failures Contained

In the world of microservices and distributed systems, resilience is just as important as functionality. Failures are inevitable — but they don’t have to bring the whole system down. One powerful resiliency strategy for this is the Bulkhead Pattern.

Just like a ship uses bulkheads (watertight compartments) to prevent one leak from sinking the entire vessel, the bulkhead pattern isolates different parts of a system so that failure in one component doesn’t cascade across the whole architecture.


Why Bulkhead Pattern?

Imagine you’re building a payment application with multiple features:

  • Payments Service – processes user transactions.
  • Notifications Service – sends email/SMS alerts.
  • Analytics Service – logs data for reporting.

If your Notifications Service suddenly gets overloaded (say, SMS gateway slows down), you don’t want that slowness to spill over into Payments. Users should still be able to make payments even if notifications are delayed.

This is where the bulkhead pattern steps in.


How Bulkhead Pattern Works

The bulkhead pattern isolates resources like threads, database connections, or APIs into separate pools for different services or functions. If one pool gets exhausted, it won’t affect others.

  • Without Bulkhead:
    All services share the same thread pool. Notifications slowing down will block threads, causing Payments to fail.
  • With Bulkhead:
    Payments, Notifications, and Analytics each get their own pool. If Notifications fail, Payments continue unaffected.

Real-World Examples

1. Thread Pool Isolation in Microservices

Suppose you’re using a microservices framework like Spring Boot or FastAPI. You can configure separate thread pools per service call.

  • Payment Service API calls → Pool A (high priority)
  • Notification Service API calls → Pool B (medium priority)
  • Analytics Logging API calls → Pool C (low priority)

Even if Pool B is overloaded, Pool A continues working, ensuring that payments aren’t blocked.


2. Database Connection Pooling

Imagine a single Postgres database serving multiple microservices. If Analytics floods the database with heavy read queries, Payments might time out.

Using bulkhead-style connection pools:

  • Payments → max 30 connections
  • Notifications → max 20 connections
  • Analytics → max 10 connections

This ensures Payments always retain their connections, regardless of Analytics load.


3. Circuit Breakers + Bulkheads

Often, the Bulkhead Pattern is combined with the Circuit Breaker Pattern.

  • Bulkhead isolates resources.
  • Circuit Breaker stops repeated calls to a failing service.

Together, they make systems far more resilient.


Benefits of the Bulkhead Pattern

Resilience: Failures stay contained.
Stability: Critical services remain available even under partial failure.
Prioritization: Allocate more resources to high-value operations (like payments).
Better UX: Users can still perform core actions even if non-critical features are degraded.


When to Use It

  • In mission-critical systems where uptime matters.
  • When services have different priorities (payments > analytics).
  • In systems prone to external dependencies (e.g., third-party APIs).

Context and problem

A cloud-based application may include multiple services, with each service having one or more consumers. Excessive load or failure in a service will impact all consumers of the service.

Moreover, a consumer may send requests to multiple services simultaneously, using resources for each request. When the consumer sends a request to a service that is misconfigured or not responding, the resources used by the client’s request may not be freed in a timely manner. As requests to the service continue, those resources may be exhausted. For example, the client’s connection pool may be exhausted. At that point, requests by the consumer to other services are affected. Eventually the consumer can no longer send requests to other services, not just the original unresponsive service.

The same issue of resource exhaustion affects services with multiple consumers. A large number of requests originating from one client may exhaust available resources in the service. Other consumers are no longer able to consume the service, causing a cascading failure effect.

Solution

Partition service instances into different groups, based on consumer load and availability requirements. This design helps to isolate failures, and allows you to sustain service functionality for some consumers, even during a failure.

A consumer can also partition resources, to ensure that resources used to call one service don’t affect the resources used to call another service. For example, a consumer that calls multiple services may be assigned a connection pool for each service. If a service begins to fail, it only affects the connection pool assigned for that service, allowing the consumer to continue using the other services.

The benefits of this pattern include:

  • Isolates consumers and services from cascading failures. An issue affecting a consumer or service can be isolated within its own bulkhead, preventing the entire solution from failing.
  • Allows you to preserve some functionality in the event of a service failure. Other services and features of the application will continue to work.
  • Allows you to deploy services that offer a different quality of service for consuming applications. A high-priority consumer pool can be configured to use high-priority services.

The following diagram shows bulkheads structured around connection pools that call individual services. If Service A fails or causes some other issue, the connection pool is isolated, so only workloads using the thread pool assigned to Service A are affected. Workloads that use Service B and C are not affected and can continue working without interruption.

The next diagram shows multiple clients calling a single service. Each client is assigned a separate service instance. Client 1 has made too many requests and overwhelmed its instance. Because each service instance is isolated from the others, the other clients can continue making calls.

Issues and considerations

  • Define partitions around the business and technical requirements of the application.
  • If using tactical DDD to design microservices, partition boundaries should align with the bounded contexts.
  • When partitioning services or consumers into bulkheads, consider the level of isolation offered by the technology as well as the overhead in terms of cost, performance and manageability.
  • Consider combining bulkheads with retry, circuit breaker, and throttling patterns to provide more sophisticated fault handling.
  • When partitioning consumers into bulkheads, consider using processes, thread pools, and semaphores. Projects like resilience4j and Polly offer a framework for creating consumer bulkheads.
  • When partitioning services into bulkheads, consider deploying them into separate virtual machines, containers, or processes. Containers offer a good balance of resource isolation with fairly low overhead.
  • Services that communicate using asynchronous messages can be isolated through different sets of queues. Each queue can have a dedicated set of instances processing messages on the queue, or a single group of instances using an algorithm to dequeue and dispatch processing.
  • Determine the level of granularity for the bulkheads. For example, if you want to distribute tenants across partitions, you could place each tenant into a separate partition, or put several tenants into one partition.
  • Monitor each partition’s performance and SLA.

Closing Thoughts

The Bulkhead Pattern isn’t about preventing failures — it’s about containing them. In microservices, where every component depends on others, bulkheads give you control over how failures spread.

Think of it as designing your system like a ship: even if one compartment floods, the rest keeps sailing.

4 Likes