The Noisy Neighbor Problem

Sonal_Monis · August 23, 2025, 4:28pm

Why multi-tenant resource-sharing saves money — and sometimes causes pain (and what to do about it)

TL;DR

Multitenant systems share pooled resources to improve utilization and lower costs. But when one tenant consumes a disproportionate share of resources (intentionally or not), other tenants suffer — the classic noisy neighbor problem. On AWS you can mitigate this with good client-side patterns like exponential backoff, retries and throttling provider-side controls quotas, QoS, rebalancing, monitoring, and for messaging workloads Amazon SQS Fair Queues, a new queue-level feature that reduces noisy-neighbor impact automatically for multi-tenant queues.

What is the noisy neighbor problem?

Multitenant systems host multiple customers on shared infrastructure: an obvious win for cost and utilization. The downside: when tenant usage spikes, their activity can consume shared capacity and degrade performance for others. That can look like slower response times, increased retries, elevated queue dwell time, or outright failures when capacity is exhausted.

Two common patterns:

1. Single-tenant spike : one tenant’s workload spikes and consumes most capacity at peak. Other tenants’ requests either queue up or fail.

2. Many small tenants, one big aggregate peak :each tenant’s usage is modest, but together they create a throughput peak you didn’t provision for.

Both are noisy-neighbor manifestations and both need different detection & mitigation approaches.

How the problem looks in practice

Imagine a single shared SQS queue where multiple customers send work items. If Customer A suddenly floods the queue or their messages take much longer to process, consumers will spend more time processing A’s messages and messages from customers B,C,D sit longer in the queue increasing dwell time for everyone.

This is exactly the scenario SQS Fair Queues is designed to mitigate by tracking message groups which you can use to represent tenants, SQS dynamically prioritizes message delivery so quiet tenants don’t see their dwell time balloon when a noisy tenant appears.

Detects noisy tenants by observing message distribution among “message groups” while messages are in-flight.

Prioritizes returning messages from quieter groups so their dwell time stays low, while noisy groups get proportionally lower priority until backlog eases.

Applies to standard queues for messages that include a MessageGroupId. It’s transparent to consumer code and doesn’t require per-consumer logic changes. Throughput isn’t artificially limited — fair queues simply change ordering/prioritization to maintain fairness.

sqs.send_message(
QueueUrl=queue_url,
MessageBody='{"task":"process-order","orderId":1234}',
# Set MessageGroupId to identify tenant or logical group
MessageGroupId='tenant-123'
)

Practical mitigations:

Actions Clients can take

Implement exponential backoff with jitter and graceful retries for transient failures. Avoid tight retry loops.
Respect documented rate limits and quotas. For predictable high-volume workloads keep the instances up and running based on the tenant latency requirements.
Make long-running or heavy operations asynchronous and run them off-peak if possible.

Actions Service Providers can take

Enforce governance: quotas, rate limits, and throttling policies that prevent a single tenant from taking whole slices of capacity.
Scale & partition: scale up/out, shard tenants across instances or queues, or add stamps to spread load.
Rebalance tenants when traffic patterns are known place complementary tenants together.
Provide paid isolation options: let customers buy pre-provisioned capacity or dedicated resources where needed.
Apply QoS: prioritize critical workloads, and make lower-priority jobs preemptable or run them in lower-cost windows.

SQS-Fair Queues:

Tag tenant identity on messages: Set MessageGroupId from producers to identify the tenant.
Monitor dwell time & per-group metrics: Build CloudWatch dashboards and alarms for per-group dwell/visibility metrics and ApproximateAgeOfOldestMessage.
Consider partitioning: If one tenant is predictably heavy, put them on a separate queue or a dedicated pipeline.
Load-test: Simulate multi-tenant traffic to verify fair-queue behavior and to tune consumer capacity

When fair queues are not enough

Fair queues help with queue-level noisy neighbors, but they’re not a silver bullet for all noisy-neighbor problems:

If the shared resource is CPU/memory on a host, we still need container or limits, OS-level controls, or separate instances.
If database request-units or upstream services are saturated, consider sharding, throttling, or provisioning dedicated resources.
If tenants intentionally try to overwhelm the system (DDoS-style), you need stronger protections (WAF, rate-limiting, AWS Shield, account-level limits).

Questions to Ponder:

Fair queues are a reminder that platform-level features can solve recurring multi-tenant problems without pushing extra complexity to application code. Is there a platform where a queue-level fairness model could simplify our architecture?
What telemetry would we add to detect noisy neighbors earlier per-tenant request-units, per-message dwell time, or something else?
Are we comfortable offering a “shared but fair” offering or do we need a paid isolation tier for latency-sensitive tenants?

Resources:

AWS SQS: Amazon SQS fair queues - Amazon Simple Queue Service