Summary:
Between the hours of 10:05 to 10:40 CET on the 27/01/2022, users could not log in to Dixa, connected users experienced issues with the offer of conversations and eventually were logged out.
One of our critical services consumed more resources than allocated, which caused requests to get queued and eventually failed.
Alerts detected the events, and a limit increase was manually applied. This was then followed by our customers reporting the outage a few minutes before our support got disconnected.
For 17 minutes, between 10:11 and 10:28 CET, all Dixa users could not access the platform.
We received a total of 32 conversations referring to the event.
The incident management process was triggered following the client's reports.
Our incident manager involved the relevant engineering teams and management at 10:11 CET.
At 10:28 CET, a fix was implemented and the affected services were restarted. Followed by our engineers detecting successful connections.
We have implemented three different measurements to prevent this from reoccurring. Firstly, we are working on removing any unlimited resources within all of our services.
Secondly, we allocated more resources to the service to increase stability.
Lastly, we have increased our alerting in the specific area and lowered the thresholds to receive alerts at an earlier stage.
Timeline (all times are in CET):
10:05. The monitoring system generated the first alerts, and the first users were disconnected.
10:11. Incident management process initiated.
10:12. All users lost connection to the platform.
10:28. Fix implemented.
10:45. Monitoring phase. All users reconnected, and the platform stabilized.
11:10 Event concluded.