Degraded performance / Dropped agent connections

Incident Report for Dixa

Postmortem

We experienced two slightly different incidents on Monday the 6th of November, both of which are covered here. The first one, starting gradually around 10.01 AM (UTC+1), affected multiple components across the Agent Interface, and was caused by the underlying database slowly becoming overloaded.

By 10.15 AM (UTC+1), the database was saturated, and affected components became very unresponsive as a result which was mostly noticeable with the Dashboards not loading. Engineers immediately convened and concluded that the relevant database had to be scaled to handle this unprecedented, persistent load. By 10.46 AM (UTC+1) the scaling was completed and only minutes later, at 10.48 AM (UTC+1), both response times and error rates had leveled off.

For the second issue, which occurred at 12.03 PM (UTC+1) WebSocket connections were dropped resulting in a high rate of reconnections. The natural surge in reconnects caused a significant spike in system load, resulting in overall slowness lasting for about 6 minutes, until connections were back at previous levels around 12.09 PM (UTC+1)

Other than the immediate actions taken which included database scaling, we will revisit our alerting to make sure we’re notified early enough, such that we can avoid protracted system degradation from happening altogether. Moreover, we are already actively working on migrating the heaviest workloads away from this database to further lower the load on the system.

For the WebSocket connection issue, we are looking to increase the predictability for when connections may be closed such that it would minimize the effect on actively working agents.

We deeply apologize for any inconvenience this may have caused.

Posted Nov 10, 2023 - 12:46 CET

Resolved

All known issues to this incident have been resolved. We thank you for your patience and cooperation.

Post-mortem about this incident will be posted within 5 business days.

Posted Nov 06, 2023 - 14:28 CET

Update

We just had the same incident occur again.

We will continue to monitor the results.

Next update at 14:30 CET

Posted Nov 06, 2023 - 12:31 CET

Monitoring

We are happy to inform, that our teams have deployed a fix for the issue. We are seeing significant improvements in the stability. We will continue to monitor the results.

Next update at 13:00 CET

Posted Nov 06, 2023 - 10:53 CET

Update

We have received reports of instability in the platform. We are investigating the issue.

Next update at 11:10 CET

Posted Nov 06, 2023 - 10:41 CET

Investigating

We have received reports of instability in the platform. We are investigating the issue. Updates will follow

Posted Nov 06, 2023 - 10:33 CET

This incident affected: Agent Interface (Agent Interface).