Enabled AWS Canary Checks for Production Endpoints

Problem Statement

In production, we are never fully certain when an application might fail.

Even though we had CloudWatch alarms configured, those alarms only triggered after a real user was affected — for example:

  • Increased 5XX errors

  • Latency spikes

  • API failures

  • Unexpected downtime

This meant:

  • We were reactive instead of proactive.

  • Failures could go unnoticed during low-traffic periods.

  • If no users were actively using a feature, we had no visibility into its health.

We needed a way to continuously verify that production endpoints were functioning , even when no customer traffic was present.

What Was Tried

To solve this, we implemented AWS CloudWatch Synthetics (Canary monitoring).

CloudWatch Synthetics allows us to create canaries — scheduled scripts that simulate real user behavior.

These canaries:

  • Follow the same routes as real users

  • Perform real actions (click, submit, authenticate, etc.)

  • Run at defined intervals

  • Generate metrics and logs

  • Capture screenshots and HAR files

We evaluated multiple blueprints available in CloudWatch Synthetics:

Canary Blueprint Options

  • Heartbeat monitoring

  • API canary

  • Broken link checker

  • Visual monitoring

  • Canary recorder

  • GUI workflow builder

  • Multi-check blueprint

  • Custom scripts

For our use case, we implemented two primary strategies.

What Worked

:one: Custom Script for Authenticated API Validation

We created a custom canary script to:

  • Generate authentication tokens

  • Perform secured API calls

  • Validate expected API responses

  • Verify business logic behavior

This allowed us to test:

  • Auth token generation

  • Protected endpoints

  • Response correctness

  • Failure conditions

This ensured backend APIs were healthy — even without user traffic.

:two: GUI Workflow Builder for End-to-End Bidding Flow

For the bidding feature, we implemented a GUI-based synthetic test using:

  • GUI Workflow Builder

  • Playwright-based execution

The canary:

  • Opened the production website

  • Logged in

  • Navigated through the bidding flow

  • Placed a test bid (controlled environment)

  • Validated UI behavior

  • Captured screenshots

  • Logged browser activity

This validated:

  • Frontend availability

  • Backend API integration

  • Authentication flow

  • Full end-to-end transaction path

This moved us from “endpoint monitoring” to “real user journey monitoring.”

:three: Alarm Integration

Each canary generates CloudWatch metrics.

We configured CloudWatch alarms such that:

If a canary fails:

  • Alarm triggers

  • Notification is sent

  • Issue detected before real users are impacted

This made our monitoring proactive instead of reactive.

What Didn’t Work / Challenges

  1. Writing authenticated custom scripts requires careful token handling.

  2. GUI workflow tests can be sensitive to UI changes.

  3. Canary frequency must be balanced with cost.

  4. Synthetic tests should not affect real production data.

  5. Proper error thresholds need tuning to avoid false alarms.

Synthetic monitoring is powerful — but it must be carefully designed.

Cost Consideration

CloudWatch Synthetics pricing:

  • $0.0015 per canary run

  • First 100 runs are free

Cost scales with:

  • Frequency

  • Number of canaries

  • Execution duration

We optimized frequency to balance coverage and cost.

Final Outcome / Learning

After implementing canary checks:

  • Production endpoints are continuously verified

  • Authentication flow is monitored

  • Critical user journeys are tested automatically

  • Failures are detected before customer impact

  • Confidence in production stability increased

Key learnings:

  • Alarms alone are not enough.

  • Synthetic monitoring fills the visibility gap during low traffic.

  • End-to-end validation is more valuable than simple heartbeat checks.

  • Monitoring should simulate real user behavior, not just check status codes.

We shifted from reactive monitoring to proactive validation.

If there are any alternative approaches to improve synthetic monitoring coverage or reduce canary costs, please feel free to share your suggestions. We’re always open to enhancing our production reliability strategy.