What happened? The Redis instance (a database) responsible for authentication stopped working. We were unable to connect to the instance. During the time AWS was having a networking outage in the region where our database is deployed. The inability to connect meant it was impossible to authenticate requests and led to all agent requests failing.
How was it fixed? We had a replica on standby in a different region, once the replica was used everything went back to a working state.
The difficult part was isolating the failure. We have never seen such a failure before and it can be difficult to diagnose new failures when everything seems to be down.
What will we do going forward to stop this from happening again? We did have monitoring that caught the problem, we just didn't know it was the root cause. We will up-prioritize said monitoring. Additionally, we will look into if we can update the configuration to allow for a faster reaction.
Posted Jul 14, 2020 - 16:26 CEST
A fix has been implemented and we are monitoring the results.
Posted Jul 14, 2020 - 15:12 CEST
The issue has been identified and a fix is being implemented.
Posted Jul 14, 2020 - 15:07 CEST
We continue investigating the issue, but we can see that it's the authentication part of the application. Conversations still being received. No data is lost.