We are receiving reports that some customers are receiving alert in the agent-interface
Incident Report for Dixa
Postmortem

Kafka post-mortem 28/07/2020

Introduction

Kafka is a distributed messaging platform and a central piece of infrastructure in Dixa that is used to power features based on real-time events such as Webhooks and Analytics. We have incorporated it into the core of most operations in order to send events and we have done it in an asynchronous way so that the outcome of an action is not impacted by any mishaps of the event being published. The Kafka cluster is not hosted by us. We are relying on a 3rd party global Kafka service provider. 

The impact on Dixa’s side

Around 2:10 PM CET, the production Kafka cluster hosted by our provider became inaccessible, which meant that all our producers and consumers of events were not able to process messages. This introduced two problems:

  1. Features that were relying on events (Webhooks, Analytics, Activity Log) stopped working.
  2. Some synchronous message publishing has slipped into core functionalities such as responding to offers, claims, and tagging of conversations. Because Kafka was unreachable and the synchronous publishing, these operations failed entirely.

Note: Dixa has a Kafka staging cluster and a production one. Only the production one was affected.

Immediate solution

For the first problem identified above, we immediately reached out to our provider for assistance in getting the cluster back online. After several attempts and dedicated support from our provider, we were unsuccessful in our attempts and did not reach a resolution. A decision was made around 8:00 PM CET, to create a new Kafka cluster from scratch and redeploy the entire platform to use the new cluster.

For the second problem identified above, we have identified the issue and fixed it by making any Kafka publishing asynchronous for all affected operations by 3:15 PM CET. After the fix, it was once again possible to perform these operations.

10:45 PM CET we managed to get everything event-based up and running again.

Longer-term solution

We will investigate the possibility of having a fallback provider and introduce immediate fail-over or even full redundancy if possible. 

Already we have a list of well known global providers and set up meetings for trials, POCs, etc.  

Furthermore, we have enhanced monitoring and alerting for all instances of Kafka in relation to the incident and solution making all Kafka publishing asynchronous for all affected operations including future ones.

Questions and support

Finally, we are very sorry for the inconvenience caused by the incident and outage, and we have done all in our power to prevent another situation like this in the future. If you have any questions please contact friends@dixa.com or your Customer Success Manager directly.

Posted Jul 30, 2020 - 14:14 CEST

Resolved
The issue has been fully resolved and the rest of the affected services (Analytics, Webhooks, Activity Log) are back to normal.
Posted Jul 28, 2020 - 23:45 CEST
Update
Analytics, WebHooks, and Activity Log are still affected. It's a cluster maintained by a third-party provider that is having issues. We will post a post-mortem when the issue is completely resolved.
Posted Jul 28, 2020 - 18:01 CEST
Update
We are continuing to monitor for any further issues. Analytics, WebHooks, and Activity Log are still affected we continue working in that.
Posted Jul 28, 2020 - 15:32 CEST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 28, 2020 - 15:20 CEST
Update
We are continuing to work on a fix for this issue.
Posted Jul 28, 2020 - 15:07 CEST
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 28, 2020 - 14:41 CEST
Update
We are continuing to investigate this issue.
Posted Jul 28, 2020 - 14:38 CEST
Investigating
We are currently investigating this issue.
Posted Jul 28, 2020 - 14:25 CEST
This incident affected: Agent Interface (Agent Interface).