Problem Statement
In production, we are never fully certain when an application might fail.
Even though we had CloudWatch alarms configured, those alarms only triggered after a real user was affected — for example:
-
Increased 5XX errors
-
Latency spikes
-
API failures
-
Unexpected downtime
This meant:
-
We were reactive instead of proactive.
-
Failures could go unnoticed during low-traffic periods.
-
If no users were actively using a feature, we had no visibility into its health.
We needed a way to continuously verify that production endpoints were functioning , even when no customer traffic was present.
What Was Tried
To solve this, we implemented AWS CloudWatch Synthetics (Canary monitoring).
CloudWatch Synthetics allows us to create canaries — scheduled scripts that simulate real user behavior.
These canaries:
-
Follow the same routes as real users
-
Perform real actions (click, submit, authenticate, etc.)
-
Run at defined intervals
-
Generate metrics and logs
-
Capture screenshots and HAR files
We evaluated multiple blueprints available in CloudWatch Synthetics:
Canary Blueprint Options
-
Heartbeat monitoring
-
API canary
-
Broken link checker
-
Visual monitoring
-
Canary recorder
-
GUI workflow builder
-
Multi-check blueprint
-
Custom scripts
For our use case, we implemented two primary strategies.
What Worked
Custom Script for Authenticated API Validation
We created a custom canary script to:
-
Generate authentication tokens
-
Perform secured API calls
-
Validate expected API responses
-
Verify business logic behavior
This allowed us to test:
-
Auth token generation
-
Protected endpoints
-
Response correctness
-
Failure conditions
This ensured backend APIs were healthy — even without user traffic.
GUI Workflow Builder for End-to-End Bidding Flow
For the bidding feature, we implemented a GUI-based synthetic test using:
-
GUI Workflow Builder
-
Playwright-based execution
The canary:
-
Opened the production website
-
Logged in
-
Navigated through the bidding flow
-
Placed a test bid (controlled environment)
-
Validated UI behavior
-
Captured screenshots
-
Logged browser activity
This validated:
-
Frontend availability
-
Backend API integration
-
Authentication flow
-
Full end-to-end transaction path
This moved us from “endpoint monitoring” to “real user journey monitoring.”
Alarm Integration
Each canary generates CloudWatch metrics.
We configured CloudWatch alarms such that:
If a canary fails:
-
Alarm triggers
-
Notification is sent
-
Issue detected before real users are impacted
This made our monitoring proactive instead of reactive.
What Didn’t Work / Challenges
-
Writing authenticated custom scripts requires careful token handling.
-
GUI workflow tests can be sensitive to UI changes.
-
Canary frequency must be balanced with cost.
-
Synthetic tests should not affect real production data.
-
Proper error thresholds need tuning to avoid false alarms.
Synthetic monitoring is powerful — but it must be carefully designed.
Cost Consideration
CloudWatch Synthetics pricing:
-
$0.0015 per canary run
-
First 100 runs are free
Cost scales with:
-
Frequency
-
Number of canaries
-
Execution duration
We optimized frequency to balance coverage and cost.
Final Outcome / Learning
After implementing canary checks:
-
Production endpoints are continuously verified
-
Authentication flow is monitored
-
Critical user journeys are tested automatically
-
Failures are detected before customer impact
-
Confidence in production stability increased
Key learnings:
-
Alarms alone are not enough.
-
Synthetic monitoring fills the visibility gap during low traffic.
-
End-to-end validation is more valuable than simple heartbeat checks.
-
Monitoring should simulate real user behavior, not just check status codes.
We shifted from reactive monitoring to proactive validation.
If there are any alternative approaches to improve synthetic monitoring coverage or reduce canary costs, please feel free to share your suggestions. We’re always open to enhancing our production reliability strategy.