We experienced two slightly different incidents on Monday the 6th of November, both of which are covered here. The first one, starting gradually around 10.01 AM (UTC+1), affected multiple components across the Agent Interface, and was caused by the underlying database slowly becoming overloaded.
By 10.15 AM (UTC+1), the database was saturated, and affected components became very unresponsive as a result which was mostly noticeable with the Dashboards not loading. Engineers immediately convened and concluded that the relevant database had to be scaled to handle this unprecedented, persistent load. By 10.46 AM (UTC+1) the scaling was completed and only minutes later, at 10.48 AM (UTC+1), both response times and error rates had leveled off.
For the second issue, which occurred at 12.03 PM (UTC+1) WebSocket connections were dropped resulting in a high rate of reconnections. The natural surge in reconnects caused a significant spike in system load, resulting in overall slowness lasting for about 6 minutes, until connections were back at previous levels around 12.09 PM (UTC+1)
Other than the immediate actions taken which included database scaling, we will revisit our alerting to make sure we’re notified early enough, such that we can avoid protracted system degradation from happening altogether. Moreover, we are already actively working on migrating the heaviest workloads away from this database to further lower the load on the system.
For the WebSocket connection issue, we are looking to increase the predictability for when connections may be closed such that it would minimize the effect on actively working agents.
We deeply apologize for any inconvenience this may have caused.