This write-up is the implementation-focused appendix. It intentionally omits “what is USSD / high-level architecture / normalization overview” topics, as those are already covered in the companion guide.
- High-level visual overview:
https://medium.com/@rajeshs09858/building-a-high-throughput-ussd-gateway-on-aws-849d6eff1cf4
Scope of This Appendix
For a production-grade USSD backend, the real complexity is not the diagrams but the edge contracts, network constraints, parsing behavior, and operational failure modes. This appendix focuses on implementation details and operational semantics:
- Carrier edge contract: HTTP interface, routing, timeouts, retry patterns, and strict request/response semantics
- VPN + static IP ingress on AWS: satisfying “static IP only, no DNS” constraints from operator gateways
- Request ingestion hardening: XML/JSON/query parsing, encoding normalization, and validation rules
- Internal standardization: canonical MSISDN, normalized input, session identity, and phase mapping
- Session/state model: TTL tuning, state minimization, consistency on retries, and recovery behavior
- Menu engine design: deterministic state machine, pagination strategy, and response-length guarantees
- Operational excellence: structured logging, distributed tracing, metrics, alarms, and runbooks
1) Carrier Edge Contract (Service-Level Guarantees to MNOs)
Treat the operator USSD gateway as a strict client with minimal tolerance for ambiguity and a very tight SLA window.
-
Deterministic endpoint per operator
- Expose stable callback patterns, e.g.:
/ussd/callback/
- This allows per-operator adapters (parsers, response serializers) on top of a shared application runtime.
- Expose stable callback patterns, e.g.:
-
Latency and timeout budget
- Design for a bounded end-to-end response time well below the operator’s USSD session timeout.
- As a working SLO: keep p99 application processing below a few seconds, including parsing, business logic, and response formatting.
-
Strict response semantics
- Always return the operator’s expected
Content-Type, HTTP status code, and response envelope. - Signal Continue vs End exactly as specified by each operator’s contract (e.g., specific XML tags, JSON fields, or flags).
- Avoid “best effort” guesses—treat any ambiguity as a validation error with an explicit operator-compliant error response.
- Always return the operator’s expected
-
Idempotent read operations
- Balance checks, list menus, and other “view-only” flows must be idempotent and safe under gateway retries.
- Design all read paths so that duplicated requests (same
session_idand phase) yield consistent outcomes without side effects.
2) Network Design for IP‑Only Gateways (VPN → Static IP → Service)
Many USSD gateways are configured to call only a static IP address reachable over a dedicated VPN or private network. A common AWS pattern is:
Operator USSD Gateway
│ (private IP reachability)
▼
IPSec Site-to-Site VPN
▼
VPC (private subnets)
▼
NLB (internal, static private IP per subnet)
▼
ECS Fargate tasks (awsvpc mode)
Key implementation details:
-
Static IP requirement
- Use an internal NLB with explicit subnet mappings and fixed private IPs.
- Publish to each operator a stable IP:port pair that is tied to the NLB, not to individual tasks.
-
Routing
- Ensure route tables for VPN-connected subnets allow:
- Traffic from the VPN to the NLB private IPs.
- Traffic from the NLB to ECS task ENIs (and back).
- Validate routing with synthetic health checks from the operator’s network (or equivalent lab network if available).
- Ensure route tables for VPN-connected subnets allow:
-
Security boundaries
- Restrict inbound traffic to the NLB by VPN CIDRs / operator source IP ranges.
- Restrict ECS task security groups so that only the NLB can reach the callback port (principle of least privilege).
- Avoid public exposure of USSD callbacks unless strictly required .
-
Health checks
- Expose a dedicated, lightweight health endpoint (e.g.
/healthz) that does not depend on downstream systems (DB, external APIs). - Configure the NLB target group health checks to be stable and cheap, so that downstream outages do not cause the whole service to be marked unhealthy.
- Expose a dedicated, lightweight health endpoint (e.g.
3) NLB + Target Group Behavior Relevant to USSD
When your upstream is a telecom USSD gateway, you want predictable and boring L4 behavior.
-
Listener configuration
- Use a TCP listener on the callback port (for example,
8080), or HTTP if you need HTTP-level health checks and routing.
- Use a TCP listener on the callback port (for example,
-
Target group configuration
- Choose a health check protocol (TCP or HTTP) that aligns with your implementation.
- Keep health checks low-latency and resource-light; they should not require database access just to return “OK”.
-
Connection handling and timeouts
- Expect gateways to reuse connections unpredictably or to aggressively open/close sockets.
- Tune server-side keepalive and idle timeouts to avoid a buildup of half-open or stalled sockets.
- Monitor connection-level metrics (e.g., connection errors, resets, timeouts) and incorporate them into alerts.
-
Client IP preservation (optional)
- If you must preserve the original operator client IP on the backend, consider Proxy Protocol on the NLB.
- Only enable Proxy Protocol if:
- Your application server stack can correctly parse it.
- Your logging/observability pipeline is configured to handle the additional header data.
- Otherwise, it can introduce confusing behavior and broken logs.
4) ECS Fargate Runtime Considerations (Stability Over Cleverness)
USSD traffic is high-churn, latency-sensitive, and comprised of many small requests. The primary design goal is a stable, horizontally scalable runtime.
-
awsvpcnetworking- Run tasks with their own ENI in private subnets; apply restrictive security groups for inbound/outbound dependencies.
-
Graceful shutdown
- Implement clean handling of
SIGTERMso that in-flight requests can finish before the task exits. - Align:
- ECS stop timeout
- NLB deregistration delay
- This prevents sessions from being abruptly terminated during rolling deployments.
- Implement clean handling of
-
Scaling strategy
- Scale on CPU and memory plus request latency and 5xx error rate where possible.
- Maintain a baseline minimum task count to avoid cold-start effects or sudden capacity cliffs.
-
Sidecars and observability agents
- If using distributed tracing or log-forwarding daemons, run them as lightweight sidecars.
- Ensure sidecars are resilient (backoff, bounded buffers) to avoid cascading failures when downstream observability systems are degraded.
5) Request Ingestion Hardening (Multiple Payload Dialects)
Multi-operator USSD integrations almost always involve heterogeneous payload formats:
- XML POST bodies
- JSON POST bodies
- Query-string-only GETs
- Form-encoded payloads (
application/x-www-form-urlencoded)
Hardening checklist:
-
Size limits and parser safety
- Enforce maximum request body sizes at the HTTP layer to avoid parser and memory blowups.
- For XML:
- Disable external entity resolution.
- Protect against entity expansion and other XML-based attacks.
-
Operator-specific quirks
- Some operators send cleanup/abort callbacks on session termination; handle these explicitly.
- Some treat the MSISDN as the session identifier; others provide an explicit
session_id. - Capture these differences in configurable adapters rather than scattering conditional logic across business code.
6) Boundary Validation + Standardization (“One Format Inside”)
The most important maintainability decision is: normalize at the edge and keep a single, canonical internal format.
A practical internal request contract typically includes:
-
MSISDN
- Store and compare in a single canonical format, commonly E.164.
- Reject invalid lengths or formats at the boundary layer.
-
USSD input
- Canonicalize user input into a stable form (e.g., a path or delimited string) that the menu engine can consistently interpret.
-
Session identity
- Generate or adopt a stable
session_idthat remains constant acrossBegin,Continue, andEndphases of the same conversation.
- Generate or adopt a stable
-
Session phase
- Map operator-specific phase indicators into a small internal enum:
BeginContinueEnd
- Map operator-specific phase indicators into a small internal enum:
-
Auxiliary metadata
- Preserve original operator fields (raw payload, operator identifiers, timestamps) in a structured way for troubleshooting, without letting them leak into core business logic.
7) Session/State Model (TTL Aligned With Network Reality)
USSD is logically session-based, but your compute layer should be stateless. Session state should live in a fast, external store.
-
Minimal external session state
- Store only what is strictly required:
- Session key
- Current menu path
- Minimal context (e.g., selected items, partially entered values)
- Use a key–value store with TTL .
- Store only what is strictly required:
-
Performance characteristics
- Ensure session read/write latency is low and predictable; this is on the critical path for every callback.
Design considerations:
-
TTL vs real session timeout
- Operator USSD sessions typically time out within a few minutes.
- Configure TTL slightly longer than the network session timeout to tolerate:
- Delayed callbacks
- Short-lived network issues
- Gateway retries
-
State minimization
- Persist a compact “menu path” and a few scalar fields; avoid storing full domain objects or large blobs.
- This reduces memory footprint and mitigates serialization issues.
-
Recovery behavior
- If a session record is missing in the middle of a flow (TTL expiry, redeploy, data loss, or long user pause):
- Return a safe, user-friendly message.
- Restart from the main/root menu rather than failing hard.
- If a session record is missing in the middle of a flow (TTL expiry, redeploy, data loss, or long user pause):
Conceptual lifecycle:
Begin → create session (ttl = now + N minutes)
Continue → read + update session + refresh ttl
End → delete (optional) + stop refreshing ttl
8) Menu Engine Design (Deterministic, Paginated, and Length‑Safe)
USSD menu engines commonly fail in production due to handset constraints and strict message-length limits, not algorithmic complexity.
Design principles:
-
State machine model
- Treat each menu as a state.
- Transitions are driven by user input (digits, text) and validated against allowed options.
-
Path-based routing
- Represent the user’s progress as a path (e.g.,
service_code/step1/step2/...) rather than scattered flags. - This makes flows easier to reason about, log, and replay.
- Represent the user’s progress as a path (e.g.,
-
Pagination
- For lists that exceed a single screen:
- Implement pagination with a clear convention (e.g., special “Next/Back” options or reserved keys like
N/B). - Avoid overloading numeric options in a way that confuses the user.
- Implement pagination with a clear convention (e.g., special “Next/Back” options or reserved keys like
- For lists that exceed a single screen:
-
Response length guarantees
- Enforce a maximum response length per operator (e.g., character cap) and format messages consistently:
- Controlled line breaks
- Stable numbering
- Truncate gracefully and prefer fewer, clearer options over dense screens.
- Enforce a maximum response length per operator (e.g., character cap) and format messages consistently:
9) Response Adaptation (Reverse Adapter Layer)
Internally, the application should emit a single canonical response shape, for example:
message: plain text payloadcontinue_or_end: enum or flag controlling session continuation
Externally, each operator expects a different wrapper format (XML structure, JSON schema, or custom fields).
Key rule:
- Perform all operator-specific response formatting at the edge.
- Keep per-operator serialization (XML/JSON, envelope structure, flags) in dedicated adapter modules.
- Do not couple business logic to specific operator formats or schemas.
Operational tip:
- Log both:
- The internal, standardized response.
- A redacted representation of the operator-specific formatted response.
- This enables you to debug “why did the gateway reject this response?” without leaking sensitive information into logs.
10) Operations: Observability, Alerting, and Runbooks
USSD failures are often exposed to the user as generic “network error” or “session dropped” messages. To root cause these, you need robust observability.
Minimum viable observability:
-
Structured logs (machine-parsable) with fields such as:
operator_keysession_idmsisdn(redacted or tokenized)session_phaselatency_msstatus/ HTTP status codeerror_code/ application error key
-
Distributed tracing
- One trace per callback.
- Annotate spans with
operator_keyandsession_idto correlate flows across retries and services.
-
Metrics (per operator)
- Request count and QPS
- p95/p99 latency
- 4xx / 5xx rates
- Session end rate
- Parse failures
- Validation failures
- Target (ECS task) health status
-
Alarms / alerts
- Sustained increase in latency beyond SLO
- Spike in parsing or validation errors
- Spike in 5xx responses
- Flapping target health in the NLB
- VPN tunnel instability (down or degraded)
Closing Remarks
Building a robust USSD gateway is primarily about clean boundaries, aggressive normalization, and resilient infrastructure, rather than complex algorithms. By:
- Treating each MNO as a pluggable adapter,
- Normalizing requests and responses into a single internal contract,
- Keeping session state in a fast TTL-backed store,
- Running on always-warm containers behind a Network Load Balancer,
you can reliably scale from hundreds of thousands to millions of USSD interactions per day while maintaining predictable behavior and operational visibility.
This write-up is the implementation-focused appendix. It intentionally skips the “what is USSD / high-level architecture / normalization overview” sections because those are already covered in the companion guide.
- Overview (visual guide):
https://medium.com/@rajeshs09858/building-a-high-throughput-ussd-gateway-on-aws-849d6eff1cf4>
Feedback & Inputs
If you have implemented a similar USSD architecture or have ideas to further optimize performance, session management, or reliability, feel free to share your feedback and suggestions. Insights from real-world deployments, alternative design patterns, or improvements are always welcome and can help make this guide more useful for others building high-throughput telecom integrations.