Summary:
Between 10:40 to 11:26 CET on the 1st of June, the agent interface presented multiple errors with different services and prevented some users from logging in.
With the first reports from customers starting, the Incident process was triggered.
Our incident manager involved the relevant engineering teams and management at 10:44hs, and an active investigation began.
After further examination, our engineers identified the root cause, which seemed to be related to the simultaneity of scheduled maintenance and a recent deployment.
AT 11:07hs, we deployed a fix and started to see improvements in all services.
We continued actively monitoring until complete recovery.
At 11:26 hs, the incident was resolved.
We have taken action and established a new procedure for future maintenance.
We are genuinely sorry for the disruption.
Timeline (all times are in CET):
10:40 We had the first customers reporting issues.
10:43 We became aware of errors spiking in our alert systems.
10:44 Incident management process triggered. Our status page was updated.
11:07 The problem was identified and fixes implemented.
11:18 Error rates were back to normal. Status updated to Monitoring.
11:26 The issue was resolved.