Engineering Appendix: Implementing a Multi‑Operator USSD Backend on AWS

rajesh · March 18, 2026, 12:14pm

This write-up is the implementation-focused appendix. It intentionally omits “what is USSD / high-level architecture / normalization overview” topics, as those are already covered in the companion guide.

High-level visual overview: https://medium.com/@rajeshs09858/building-a-high-throughput-ussd-gateway-on-aws-849d6eff1cf4

Scope of This Appendix

For a production-grade USSD backend, the real complexity is not the diagrams but the edge contracts, network constraints, parsing behavior, and operational failure modes. This appendix focuses on implementation details and operational semantics:

Carrier edge contract: HTTP interface, routing, timeouts, retry patterns, and strict request/response semantics
VPN + static IP ingress on AWS: satisfying “static IP only, no DNS” constraints from operator gateways
Request ingestion hardening: XML/JSON/query parsing, encoding normalization, and validation rules
Internal standardization: canonical MSISDN, normalized input, session identity, and phase mapping
Session/state model: TTL tuning, state minimization, consistency on retries, and recovery behavior
Menu engine design: deterministic state machine, pagination strategy, and response-length guarantees
Operational excellence: structured logging, distributed tracing, metrics, alarms, and runbooks

1) Carrier Edge Contract (Service-Level Guarantees to MNOs)

Treat the operator USSD gateway as a strict client with minimal tolerance for ambiguity and a very tight SLA window.

Deterministic endpoint per operator
- Expose stable callback patterns, e.g.:
  - /ussd/callback/
- This allows per-operator adapters (parsers, response serializers) on top of a shared application runtime.
Latency and timeout budget
- Design for a bounded end-to-end response time well below the operator’s USSD session timeout.
- As a working SLO: keep p99 application processing below a few seconds, including parsing, business logic, and response formatting.
Strict response semantics
- Always return the operator’s expected Content-Type, HTTP status code, and response envelope.
- Signal Continue vs End exactly as specified by each operator’s contract (e.g., specific XML tags, JSON fields, or flags).
- Avoid “best effort” guesses—treat any ambiguity as a validation error with an explicit operator-compliant error response.
Idempotent read operations
- Balance checks, list menus, and other “view-only” flows must be idempotent and safe under gateway retries.
- Design all read paths so that duplicated requests (same session_id and phase) yield consistent outcomes without side effects.

2) Network Design for IP‑Only Gateways (VPN → Static IP → Service)

Many USSD gateways are configured to call only a static IP address reachable over a dedicated VPN or private network. A common AWS pattern is:

Operator USSD Gateway
   │  (private IP reachability)
   ▼
IPSec Site-to-Site VPN
   ▼
VPC (private subnets)
   ▼
NLB (internal, static private IP per subnet)
   ▼
ECS Fargate tasks (awsvpc mode)

Key implementation details:

Static IP requirement
- Use an internal NLB with explicit subnet mappings and fixed private IPs.
- Publish to each operator a stable IP:port pair that is tied to the NLB, not to individual tasks.
Routing
- Ensure route tables for VPN-connected subnets allow:
  - Traffic from the VPN to the NLB private IPs.
  - Traffic from the NLB to ECS task ENIs (and back).
- Validate routing with synthetic health checks from the operator’s network (or equivalent lab network if available).
Security boundaries
- Restrict inbound traffic to the NLB by VPN CIDRs / operator source IP ranges.
- Restrict ECS task security groups so that only the NLB can reach the callback port (principle of least privilege).
- Avoid public exposure of USSD callbacks unless strictly required .
Health checks
- Expose a dedicated, lightweight health endpoint (e.g. /healthz) that does not depend on downstream systems (DB, external APIs).
- Configure the NLB target group health checks to be stable and cheap, so that downstream outages do not cause the whole service to be marked unhealthy.

3) NLB + Target Group Behavior Relevant to USSD

When your upstream is a telecom USSD gateway, you want predictable and boring L4 behavior.

Listener configuration
- Use a TCP listener on the callback port (for example, 8080), or HTTP if you need HTTP-level health checks and routing.
Target group configuration
- Choose a health check protocol (TCP or HTTP) that aligns with your implementation.
- Keep health checks low-latency and resource-light; they should not require database access just to return “OK”.
Connection handling and timeouts
- Expect gateways to reuse connections unpredictably or to aggressively open/close sockets.
- Tune server-side keepalive and idle timeouts to avoid a buildup of half-open or stalled sockets.
- Monitor connection-level metrics (e.g., connection errors, resets, timeouts) and incorporate them into alerts.
Client IP preservation (optional)
- If you must preserve the original operator client IP on the backend, consider Proxy Protocol on the NLB.
- Only enable Proxy Protocol if:
  - Your application server stack can correctly parse it.
  - Your logging/observability pipeline is configured to handle the additional header data.
- Otherwise, it can introduce confusing behavior and broken logs.

4) ECS Fargate Runtime Considerations (Stability Over Cleverness)

USSD traffic is high-churn, latency-sensitive, and comprised of many small requests. The primary design goal is a stable, horizontally scalable runtime.

awsvpc networking
- Run tasks with their own ENI in private subnets; apply restrictive security groups for inbound/outbound dependencies.
Graceful shutdown
- Implement clean handling of SIGTERM so that in-flight requests can finish before the task exits.
- Align:
  - ECS stop timeout
  - NLB deregistration delay
- This prevents sessions from being abruptly terminated during rolling deployments.
Scaling strategy
- Scale on CPU and memory plus request latency and 5xx error rate where possible.
- Maintain a baseline minimum task count to avoid cold-start effects or sudden capacity cliffs.
Sidecars and observability agents
- If using distributed tracing or log-forwarding daemons, run them as lightweight sidecars.
- Ensure sidecars are resilient (backoff, bounded buffers) to avoid cascading failures when downstream observability systems are degraded.

5) Request Ingestion Hardening (Multiple Payload Dialects)

Multi-operator USSD integrations almost always involve heterogeneous payload formats:

XML POST bodies
JSON POST bodies
Query-string-only GETs
Form-encoded payloads (application/x-www-form-urlencoded)

Hardening checklist:

Size limits and parser safety
- Enforce maximum request body sizes at the HTTP layer to avoid parser and memory blowups.
- For XML:
  - Disable external entity resolution.
  - Protect against entity expansion and other XML-based attacks.
Operator-specific quirks
- Some operators send cleanup/abort callbacks on session termination; handle these explicitly.
- Some treat the MSISDN as the session identifier; others provide an explicit session_id.
- Capture these differences in configurable adapters rather than scattering conditional logic across business code.

6) Boundary Validation + Standardization (“One Format Inside”)

The most important maintainability decision is: normalize at the edge and keep a single, canonical internal format.

A practical internal request contract typically includes:

MSISDN
- Store and compare in a single canonical format, commonly E.164.
- Reject invalid lengths or formats at the boundary layer.
USSD input
- Canonicalize user input into a stable form (e.g., a path or delimited string) that the menu engine can consistently interpret.
Session identity
- Generate or adopt a stable session_id that remains constant across Begin, Continue, and End phases of the same conversation.
Session phase
- Map operator-specific phase indicators into a small internal enum:
  - Begin
  - Continue
  - End
Auxiliary metadata
- Preserve original operator fields (raw payload, operator identifiers, timestamps) in a structured way for troubleshooting, without letting them leak into core business logic.

7) Session/State Model (TTL Aligned With Network Reality)

USSD is logically session-based, but your compute layer should be stateless. Session state should live in a fast, external store.

Minimal external session state
- Store only what is strictly required:
  - Session key
  - Current menu path
  - Minimal context (e.g., selected items, partially entered values)
- Use a key–value store with TTL .
Performance characteristics
- Ensure session read/write latency is low and predictable; this is on the critical path for every callback.

Design considerations:

TTL vs real session timeout
- Operator USSD sessions typically time out within a few minutes.
- Configure TTL slightly longer than the network session timeout to tolerate:
  - Delayed callbacks
  - Short-lived network issues
  - Gateway retries
State minimization
- Persist a compact “menu path” and a few scalar fields; avoid storing full domain objects or large blobs.
- This reduces memory footprint and mitigates serialization issues.
Recovery behavior
- If a session record is missing in the middle of a flow (TTL expiry, redeploy, data loss, or long user pause):
  - Return a safe, user-friendly message.
  - Restart from the main/root menu rather than failing hard.

Conceptual lifecycle:

Begin    → create session (ttl = now + N minutes)
Continue → read + update session + refresh ttl
End      → delete (optional) + stop refreshing ttl

8) Menu Engine Design (Deterministic, Paginated, and Length‑Safe)

USSD menu engines commonly fail in production due to handset constraints and strict message-length limits, not algorithmic complexity.

Design principles:

State machine model
- Treat each menu as a state.
- Transitions are driven by user input (digits, text) and validated against allowed options.
Path-based routing
- Represent the user’s progress as a path (e.g., service_code/step1/step2/...) rather than scattered flags.
- This makes flows easier to reason about, log, and replay.
Pagination
- For lists that exceed a single screen:
  - Implement pagination with a clear convention (e.g., special “Next/Back” options or reserved keys like N / B).
  - Avoid overloading numeric options in a way that confuses the user.
Response length guarantees
- Enforce a maximum response length per operator (e.g., character cap) and format messages consistently:
  - Controlled line breaks
  - Stable numbering
- Truncate gracefully and prefer fewer, clearer options over dense screens.

9) Response Adaptation (Reverse Adapter Layer)

Internally, the application should emit a single canonical response shape, for example:

message: plain text payload
continue_or_end: enum or flag controlling session continuation

Externally, each operator expects a different wrapper format (XML structure, JSON schema, or custom fields).

Key rule:

Perform all operator-specific response formatting at the edge.
- Keep per-operator serialization (XML/JSON, envelope structure, flags) in dedicated adapter modules.
- Do not couple business logic to specific operator formats or schemas.

Operational tip:

Log both:
- The internal, standardized response.
- A redacted representation of the operator-specific formatted response.
This enables you to debug “why did the gateway reject this response?” without leaking sensitive information into logs.

10) Operations: Observability, Alerting, and Runbooks

USSD failures are often exposed to the user as generic “network error” or “session dropped” messages. To root cause these, you need robust observability.

Minimum viable observability:

Structured logs (machine-parsable) with fields such as:
- operator_key
- session_id
- msisdn (redacted or tokenized)
- session_phase
- latency_ms
- status / HTTP status code
- error_code / application error key
Distributed tracing
- One trace per callback.
- Annotate spans with operator_key and session_id to correlate flows across retries and services.
Metrics (per operator)
- Request count and QPS
- p95/p99 latency
- 4xx / 5xx rates
- Session end rate
- Parse failures
- Validation failures
- Target (ECS task) health status
Alarms / alerts
- Sustained increase in latency beyond SLO
- Spike in parsing or validation errors
- Spike in 5xx responses
- Flapping target health in the NLB
- VPN tunnel instability (down or degraded)

Closing Remarks

Building a robust USSD gateway is primarily about clean boundaries, aggressive normalization, and resilient infrastructure, rather than complex algorithms. By:

Treating each MNO as a pluggable adapter,
Normalizing requests and responses into a single internal contract,
Keeping session state in a fast TTL-backed store,
Running on always-warm containers behind a Network Load Balancer,

you can reliably scale from hundreds of thousands to millions of USSD interactions per day while maintaining predictable behavior and operational visibility.

This write-up is the implementation-focused appendix. It intentionally skips the “what is USSD / high-level architecture / normalization overview” sections because those are already covered in the companion guide.

Overview (visual guide): https://medium.com/@rajeshs09858/building-a-high-throughput-ussd-gateway-on-aws-849d6eff1cf4>

Feedback & Inputs

If you have implemented a similar USSD architecture or have ideas to further optimize performance, session management, or reliability, feel free to share your feedback and suggestions. Insights from real-world deployments, alternative design patterns, or improvements are always welcome and can help make this guide more useful for others building high-throughput telecom integrations.