General platform instability

Incident Report for Dixa

Postmortem

Summary:

Between the hours of 10:05 to 10:40 CET on the 27/01/2022, users could not log in to Dixa, connected users experienced issues with the offer of conversations and eventually were logged out.

One of our critical services consumed more resources than allocated, which caused requests to get queued and eventually failed.

Alerts detected the events, and a limit increase was manually applied. This was then followed by our customers reporting the outage a few minutes before our support got disconnected.

For 17 minutes, between 10:11 and 10:28 CET, all Dixa users could not access the platform.

We received a total of 32 conversations referring to the event.

The incident management process was triggered following the client's reports.

Our incident manager involved the relevant engineering teams and management at 10:11 CET.

At 10:28 CET, a fix was implemented and the affected services were restarted. Followed by our engineers detecting successful connections.

We have implemented three different measurements to prevent this from reoccurring. Firstly, we are working on removing any unlimited resources within all of our services.

Secondly, we allocated more resources to the service to increase stability.

Lastly, we have increased our alerting in the specific area and lowered the thresholds to receive alerts at an earlier stage.

Timeline (all times are in CET):

10:05. The monitoring system generated the first alerts, and the first users were disconnected.

10:11. Incident management process initiated.

10:12. All users lost connection to the platform.

10:28. Fix implemented.

10:45. Monitoring phase. All users reconnected, and the platform stabilized.

11:10 Event concluded.

Posted Feb 04, 2022 - 16:55 CET

Resolved

All known issues to this incident have been resolved. We thank you for your patience and cooperation.

Post mortem about this incident will be posted within 5 business days.

Posted Jan 27, 2022 - 11:10 CET

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 27, 2022 - 10:39 CET

Update

We are continuing to work on a fix for this issue.

Posted Jan 27, 2022 - 10:38 CET

Update

We are continuing to work on a fix for this issue.

Posted Jan 27, 2022 - 10:29 CET

Identified

Our team have identified the issue and we are working on a fix, that will resolve the issue.
We see improvements in the service and agents are able to login.
We keep monitoring.

Next update: 10:45hs

Posted Jan 27, 2022 - 10:28 CET

Investigating

We have received reports of instability in the platform, affecting the access and service in Dixa. We are investigating the issue.

Next update at 10:45hs

Posted Jan 27, 2022 - 10:17 CET

This incident affected: Agent Interface (Agent Interface).