What Happened:
On January 5th (09:19 AM EST), customers on our US server began to experience around 8 minutes of difficulty loading boards, followed by 23 minutes of system downtime due to an application-level issue, totaling a 31-minute incident.
There was no data loss or security risk to our customers during this incident, and EMEA and APAC servers were not impacted by the platform access issue.
What caused it:
A recent code change in our left pane feature created an overload on our system, which as a result, affected data retrieval.
How we responded:
Our monitoring system promptly identified the issue, triggering an immediate response from our engineering team. While working to resolve the issue, the loading issue turned into a full platform downtime. Our team resolved the issue by scaling our data layer servers and restoring normal service by 09:50 AM EST.
What we're doing to prevent recurrence:
We will implement a new performance process in our pre-production environment that enables us to proactively identify and prevent similar issues, thereby minimizing their impact on our customers. To enhance system resilience, we are implementing additional layers of automated testing and protection, including strengthening our error detection and recovery processes.
We know that any access disruption to your monday.com workflows impacts your productivity, and we're sorry for any frustration or inconvenience caused.
We remain committed to continuously improving the platform's reliability, resiliency, and protecting your work platform’s uptime goals.
Your engineering team at monday.com