Kafka is a distributed messaging platform and a central piece of infrastructure in Dixa that is used to power features based on real-time events such as Webhooks and Analytics. We have incorporated it into the core of most operations in order to send events and we have done it in an asynchronous way so that the outcome of an action is not impacted by any mishaps of the event being published. The Kafka cluster is not hosted by us. We are relying on a 3rd party global Kafka service provider.
Around 2:10 PM CET, the production Kafka cluster hosted by our provider became inaccessible, which meant that all our producers and consumers of events were not able to process messages. This introduced two problems:
Note: Dixa has a Kafka staging cluster and a production one. Only the production one was affected.
For the first problem identified above, we immediately reached out to our provider for assistance in getting the cluster back online. After several attempts and dedicated support from our provider, we were unsuccessful in our attempts and did not reach a resolution. A decision was made around 8:00 PM CET, to create a new Kafka cluster from scratch and redeploy the entire platform to use the new cluster.
For the second problem identified above, we have identified the issue and fixed it by making any Kafka publishing asynchronous for all affected operations by 3:15 PM CET. After the fix, it was once again possible to perform these operations.
10:45 PM CET we managed to get everything event-based up and running again.
We will investigate the possibility of having a fallback provider and introduce immediate fail-over or even full redundancy if possible.
Already we have a list of well known global providers and set up meetings for trials, POCs, etc.
Furthermore, we have enhanced monitoring and alerting for all instances of Kafka in relation to the incident and solution making all Kafka publishing asynchronous for all affected operations including future ones.
Finally, we are very sorry for the inconvenience caused by the incident and outage, and we have done all in our power to prevent another situation like this in the future. If you have any questions please contact email@example.com or your Customer Success Manager directly.