Partial outage of Agent Interface connections

Incident Report for Dixa

Postmortem

Summary

On April 8, 2026, Dixa experienced a partial outage lasting approximately 58 minutes. New WebSocket connections were unable to be established, which meant that new logins failed, and any agents who refreshed their browser or lost their connection could not reconnect. Agents who remained on an existing session were unaffected during the incident.

The root cause was a TLS certificate misconfiguration introduced during a planned migration of our ingress controller infrastructure. The issue was identified, fixed, and fully resolved within the hour.

Impact

Between 18:26 and 19:24 CEST, customers attempting to log in to Dixa or re-establish a WebSocket connection (e.g., after a page reload) were unable to do so. Browsers rejected the connection due to an invalid TLS certificate being served.

Agents who were already logged in with an active WebSocket session continued to operate normally throughout the incident. The impact was limited to new or reconnecting sessions.

A small number of customers were affected and reported the issue to our support team.

No conversations or data have been lost during the incident.

Timeline (CEST)

18:26 - Internal reports that Dixa is not loading for some users
18:30 - Issue escalated to engineering via our critical support channel
18:32 - Status page updated to Investigating
18:40 - Engineering identifies a TLS certificate error (ERR_CERT_AUTHORITY_INVALID)
18:52 - Root cause identified, a self-signed default certificate was being served instead of the correct one
19:00 - Status page updated to Identified
19:11 - Fix deployed; WebSocket connections begin recovering
19:12 - Status page updated to Monitoring
19:24 - Full recovery confirmed; status page updated to Resolved

Root Cause

As part of our ongoing WebSocket resilience work to ensure a more stable platform, we migrated our ingress controller (the component that routes incoming traffic to internal services) from an end-of-support solution to a new one. This migration was tested in our staging environment before being applied to production.

However, there was a configuration discrepancy between staging and production for the ingress class that handles WebSocket traffic. When DNS was switched to the new ingress controller on the morning of April 8, existing connections continued to work through cached DNS entries still pointing to the old controller. Hours later, as DNS caches expired across the internet, clients began resolving to the new controller, which, due to the misconfiguration, did not recognize the WebSocket routes. This caused it to serve a default self-signed TLS certificate instead of the valid one, leading browsers to reject the connection.

The length of the partial outage per customer varies depending on when the DNS cache expired, and if WebSocket connections started to connect to the new load balancer.

Resolution

Once the root cause was identified, a configuration update was deployed to the new ingress controller so that it could correctly handle WebSocket traffic. Connections began recovering immediately after the fix was applied.

Preventive Measures

We have taken the following steps to reduce the likelihood and impact of similar issues in the future:

Environment parity: All non-production environments have been aligned with production configuration conventions, eliminating the discrepancy that caused this incident.
Endpoint monitoring: We have added external monitoring checks that validate both the availability and TLS certificate validity of our WebSocket endpoints. This will enable faster detection if a similar issue occurs.
WebSocket isolation: We will isolate the platform's WebSocket requirement to be optional, so if a similar issue should happen in the future, the disruption will be less intrusive for users.

‌

We sincerely apologize for the disruption this caused. Reliability is a top priority for us, and we are committed to learning from every incident to make Dixa more resilient. If you have any questions, please don't hesitate to reach out to your account team or our support at friends@dixa.com.

Posted Apr 10, 2026 - 14:22 CEST

Resolved

This incident has now been resolved.

Our sincere apologies for the disruption.

We identified the cause of the issue being a configuration issue on a new entry point for agent connections towards agent interface that was introduced earlier today.

More information will be shared in the post mortem, which will be posted here within 5 business days.

If you have questions about today's outage, feel free to reach out to friends@dixa.com. We'll be happy to help.

Posted Apr 08, 2026 - 19:24 CEST

Monitoring

We are happy to inform, that our teams have deployed a fix for the issue.

We are seeing agents successfully reconnect to agent interface. We will continue to monitor the results.

Next update at 17:30 UTC (19:30 CEST)

Posted Apr 08, 2026 - 19:12 CEST

Identified

We've identified the cause of the issue, and are actively working towards a solution.

While we do so, we also confirmed that this only impacts connections towards the agent interface. Other parts of Dixa are not impacted by this partial outage.

We'll report back at 17:15 UTC (19:15 CEST), or earlier as soon as we have more information to share.

Posted Apr 08, 2026 - 19:00 CEST

Update

We're still trying to find the cause. We'll update you again in 15 minutes, or sooner if we have identified the problem.

Posted Apr 08, 2026 - 18:50 CEST

Update

We are continuing to investigate this issue.

Posted Apr 08, 2026 - 18:37 CEST

Investigating

We've received reports of Dixa not loading. We're investigating and will come back to you with an update as soon as we know more.

Next update at 16:45 UTC (18:45 CEST)

Posted Apr 08, 2026 - 18:32 CEST

This incident affected: Agent Interface (Agent Interface).