Investigating connectivity issues across accounts and devices

Incident Report for monday.com

Postmortem

Earlier this week, on January 30th, there was a disruption to our US servers causing platform downtime and resulting in connectivity issues for some of our customers. There was no data loss or security risk at any point throughout the incident, and our EMEA and APAC servers continued to operate normally.

At 11:13 am EST, a rare sequence of events in one of our search flows, together with an atypical volume of platform actions, temporarily overloaded the platform which resulted in downtime on our US servers. The team immediately worked to identify and resolve the issue and successfully restored platform availability by 11:58 am EST.

At 2:02 pm EST, as part of our corrective efforts to ensure risk-free and stable long-term service, unintentional stress was caused on the platform. As such, we took the decision to limit customer access to the platform, minimizing the duration of downtime and ensuring fully restored access that can be relied on for all of our customers.

We rolled out a fix for the root cause at 2:43 pm EST, and then gradually restored access to the platform for all our customers, while continuing to closely monitor the situation.

We know how crucial the platform is to our customers' daily work, and in order to prioritize restoring availability first, we provided restricted access to our API, automations, dashboards, and search functionality until we ensured the platform’s full stability.

This limiting of access caused some of our customers to see a ‘location not supported’ message within their account. We regret any confusion it may have caused and we’ve already implemented communication changes around access and performance issues.

All accounts' access was restored by 3:21 pm EST, with full stability for all automations, dashboards and search functions restored by 4:47 pm EST. We continued to monitor closely, and all platform components have been operating normally since then.

After conducting a comprehensive internal investigation, we’re confident we have identified the root cause and fully blocked the flow that caused the issue, thus preventing this specific flow from recurring in the future. In addition, we are taking a variety of proactive preventive measures, such as adding additional controls and resources to the platform to ensure its ongoing stability.

We want to deeply apologize for any inconvenience caused by this incident. We know monday.com is business essential for our customers and it is our utmost priority to provide a stable, exceptional experience on our platform. Transparency is one of our core values and as such we believe in sharing a summary of our internal investigations with our customers to explain the cause of the disruption to your workflows and the action we're taking from here.

If you have any additional questions please reach out to our dedicated support team here.

Posted Feb 01, 2024 - 12:47 UTC

Resolved

The platform is now back to regular service. Please refresh your browser to access the platform. Thank you for your patience

Posted Jan 30, 2024 - 17:26 UTC

Monitoring

A fix has been implemented and we are monitoring results to restore full system stability

Posted Jan 30, 2024 - 17:06 UTC

Identified

Our team has identified the root cause of the issue and is working to resume regular service usage. We will continue to provide updates on their progress

Posted Jan 30, 2024 - 16:39 UTC

Investigating

We are currently investigating reports of connectivity issues across the platform. Our team is working to resolve this promptly.

Posted Jan 30, 2024 - 16:20 UTC

This incident affected: US (Platform) and EU (Platform).