What Happened:
On February 16th (09:04 AM EST), our US customers experienced a 28-minute system downtime due to an application-level bug.
What caused it:
A recent code change in the billing subsystem responsible for updating account plans introduced an edge-case bug that dramatically increased platform load, which caused system downtime.
How we responded:
Our monitoring system promptly identified the issue, triggering an immediate response from our engineering team. By activating failover mechanisms, reverting the problematic code change, and blocking the affected endpoint, we restored normal service by 09:32 AM EST.
What we're doing to prevent recurrence:
To enhance system resilience, we are implementing database optimizations, additional guardrails, and limits to prevent load spikes on the platform. It also includes strengthening our error detection and recovery processes.
We sincerely apologize for any disruption this caused to your workflow.
Thank you for your understanding as we work to continuously improve our platform's reliability.
Your team at monday.com