Degraded performance

Incident Report for Dixa

Postmortem

Summary: On March 11, 2026, Dixa experienced platform-wide degraded performance lasting approximately 3 hours (08:30 - 12:35 CET). Customers experienced slow or failed conversation loading, timeouts on email sending, conversation transfers, assignments, and flow processing. No data was lost, and there were no security issues at any point.

Impact:

Availability: Platform-wide slowness and partial inaccessibility for ~3 hours.
Affected functionality: Conversation loading, email sending, conversation transfers, conversation assignments, and flow processing - all experienced significant slowness and intermittent failures.
Data integrity: All emails were fully processed after the fix. No data was lost, and no security issues occurred at any point.

Root Cause: The incident was caused by an atypical traffic pattern in inbound email processing that resulted in repeated internal retries - retries are a normal part of email distribution, accounting for factors such as sending delays and server availability, but this expanded exponentially. The sustained retry volume placed excessive load on a central platform component, causing cascading timeouts across dependent services and resulting in platform-wide degradation.

Timeline (CET):

Mar 11, 06:00 - First signs of email processing errors detected
Mar 11, 08:30 - Platform degradation begins; customer impact starts
Mar 11, 10:50 - First mitigation deployed; partial improvement
Mar 11, 12:30 - Root cause fully identified; final fix applied
Mar 11, 12:35 - Platform stability confirmed

Resolution: We identified and addressed the source of the abnormal email volume, which resulted in an immediate reduction in error rates, and the platform to recover.

What We Have Done Since This Incident: Following this incident, we have already implemented the following improvements:

Added validation to reject invalid email addresses early in the pipeline, preventing them from entering retry loops.
Optimised internal lookups to fetch only necessary data instead of the full conversation history, significantly reducing load during email processing.
Added deduplication logic to prevent redundant data fetches during email processing.
Enforced concurrency limits: platform components now shed excess traffic when saturated, allowing requests to be redistributed rather than queued indefinitely.
Added deadline checking: expired requests are now discarded immediately instead of consuming resources on work that is no longer needed.
Reduced internal timeout thresholds to fail fast under contention rather than blocking for extended periods.

What We're Continuing to Work On:

Loop detection and interruption - Introduce mechanisms to detect and automatically halt email processing anomalies before they can accumulate significant load.
Improved alerting and escalation - Ensure processing anomalies are detected and escalated with appropriate urgency.

Closing Note: We sincerely apologize for the disruption this caused. These improvements are our highest priority. If you have any questions, please reach out to friends@dixa.com.

Posted Mar 13, 2026 - 13:27 CET

Resolved

All known issues linked to this incident have been fixed and full recovery of the platform has been confirmed.
We thank you for your patience and cooperation.

A postmortem will be published within 5 business days.

Posted Mar 11, 2026 - 12:31 CET

Monitoring

A fix has been deployed and we are observing improvements across system metrics. Our team will continue to actively monitor the situation until full recovery is confirmed.

Next update: 12:30 CET

Posted Mar 11, 2026 - 11:54 CET

Update

We continue to actively work on resolving this incident with the highest priority. Our team remains fully engaged and further updates will follow as our investigation progresses.

Next update: 12:00 hs

Posted Mar 11, 2026 - 11:35 CET

Identified

We have identified some additional disruptions in the service. Our team is actively working on a fix.

Next update: 11:30hs

Posted Mar 11, 2026 - 10:59 CET

Update

We continue to observe improvements. We can confirm there was no data loss. Please note that as a side effect, some conversations may have been routed to the default queue. We apologize for any inconvenience caused.

Next update: 11:00 CET

Posted Mar 11, 2026 - 10:31 CET

Update

We are continuing to monitor system metrics and are observing a recovering trend.

Next update: 10:30 CET

Posted Mar 11, 2026 - 10:14 CET

Monitoring

Our teams has now identified the issue. We are working on a fix, that will resolve the issue. We will provide more information soon.

Next update at 10:10 CET

Posted Mar 11, 2026 - 09:54 CET

Update

We have received reports of instability in the platform. We are investigating the issue. Updates will follow

Posted Mar 11, 2026 - 09:51 CET

Update

We are receiving reports of slowness in the agent interface. We are investigating the issue.

Next update at 09:50 CET

Posted Mar 11, 2026 - 09:34 CET

Investigating

We have received reports of instability in the platform. We are investigating the issue. Updates will follow

Posted Mar 11, 2026 - 09:18 CET

This incident affected: Agent Interface (Agent Interface).