Delays in the Dashboard, Analytics, and Offers for some queues
Incident Report for Dixa
Postmortem

Partial Service Disruption at Dixa (October 1, 2024 - October 3, 2024)

Incident Overview

Dixa experienced a partial service disruption from October 1, 2024, 13:07 CEST to October 3, 2024, 20:20 CEST, causing conversations to be routed to incorrect queues. This disruption primarily impacted our queue management system, leading to delays in conversation ingestion due to manual intervention.

Timeline

  • October 1, 2024, 13:07 CEST: Incident begins. Conversations start landing in incorrect queues.
  • October 1, 2024, 14:30 CEST: The issue is mitigated, and conversations no longer land in incorrect queues.
  • October 1, 2024, 15:00 CEST: Impact increases as some queues become blocked, preventing new conversations from being offered.
  • October 2, 2024, 18:00 CEST: Hot-fixes are implemented, restoring queue routing functionality and unblocking queues.
  • October 3, 2024, 20:20 CEST: Full service is restored.

Root Cause

The disruption was caused by a change in the encoding protocol for the queue cache, which led to incorrect routing of conversations. Though this was quickly mitigated, queues became blocked later in the day, preventing new conversation offerings. There were no data inaccuracies; however, manual intervention was necessary to address the incorrect routing, which delayed conversation ingestion on October 1, 2024.

Impact

  • Some queues were blocked, preventing new conversation offerings.
  • There were delays in conversation ingestion due to the need for manual intervention.

Resolution

  • Hot-fixes were applied to the queue and offer services on October 2, 2024, restoring routing functionality and unblocking queues.
  • Full service was restored by October 3, 2024.

Aftermath and Long-term Actions

To avoid future incidents, the following actions will be taken:

  1. Protocol Review: Review and refine the encoding protocol to ensure stable queue operations.
  2. Queue Service Improvements: Implement enhancements to prevent future misrouting and the need for manual intervention.
  3. Enhanced Monitoring: Improve monitoring systems to detect routing issues earlier.
  4. Post-Incident Review: Conduct a retrospective to gather insights and improve overall system resilience.

Conclusion

This incident highlighted areas for improvement in our queue management system, particularly in handling protocol changes. While the immediate issue was resolved with hot-fixes, we are taking proactive steps to enhance the system’s resilience and ensure smooth conversation routing and ingestion moving forward.


Dixa Incident Management Team
Date: October 4, 2024

Posted Oct 04, 2024 - 15:09 CEST

Resolved
This incident has been resolved.
Posted Oct 02, 2024 - 20:20 CEST
Update
We’re pleased to report that the delays affecting the Dashboard and Analytics have been resolved, and they are now functioning as expected. However, we are still experiencing issues with Queues and Offers.

We apologize for the continued inconvenience and appreciate your patience as we work to resolve the remaining issues. Further updates will be provided as soon as more information is available.

Thank you for your understanding.

Next update: Oct 2, 11:00AM UTC
Posted Oct 02, 2024 - 10:25 CEST
Update
Our team is continuing to work on resolving the delays affecting the Dashboard and Analytics. While we are making progress, the issue is not yet fully resolved.

We sincerely appreciate your patience as we work to restore normal service. Further updates will be provided as we have more information.

Thank you for your continued understanding.

Next Update: Oct 2, 2024 09:00 UTC
Posted Oct 02, 2024 - 09:38 CEST
Update
Our team is still actively working to resolve the ongoing delays in the Dashboard and Analytics sections. While progress has been made, we have not yet fully restored functionality. We are continuing to investigate and implement fixes.

We understand the inconvenience this may be causing and appreciate your continued patience. We will provide further updates as soon as more information is available.

Thank you for your understanding.
Posted Oct 01, 2024 - 15:29 CEST
Update
We are continuing to work on a fix for this issue.
Posted Oct 01, 2024 - 14:20 CEST
Update
We are continuing to work on a fix for this issue.
Posted Oct 01, 2024 - 14:18 CEST
Update
We are continuing to work on a fix for this issue.
Posted Oct 01, 2024 - 14:15 CEST
Identified
We are currently experiencing delays in the Dashboard and Analytics sections of our platform. Our team is actively investigating the issue, and we are working to restore full functionality as quickly as possible.

We apologize for the inconvenience and appreciate your patience. Further updates will be posted here as soon as they are available.

Thank you for your understanding.

Next Update: 13:00 UCT
Posted Oct 01, 2024 - 14:12 CEST
This incident affected: Agent Interface (Analytics, Conversation Offers (WSS connection), Dashboard).