What Happened:
On July 29th (09:04 AM EST), our US customers experienced a 23-minute incident, including approximately 16 minutes of difficulty loading boards and a subsequent 7-minute system downtime due to an application-level bug.
What caused it:
A recent code change intended to optimize our infrastructure's speed and scale inadvertently introduced a bug affecting data retrieval.
How we responded:
Our monitoring system promptly identified the issue, triggering an immediate response from our engineering team. By activating failover mechanisms and reverting the problematic code change, we restored normal service by 09:27 AM EST.
What we're doing to prevent recurrence:
To enhance system resilience, we are implementing additional layers of automated testing and protection, including strengthening our error detection and recovery processes.
We sincerely apologize for any disruption this caused to your workflow.
Thank you for your understanding as we work to continuously improve our platform's reliability.
Your team at monday.com