Partial Service Disruption at Dixa (October 1, 2024 - October 3, 2024)
Incident Overview
Dixa experienced a partial service disruption from October 1, 2024, 13:07 CEST to October 3, 2024, 20:20 CEST, causing conversations to be routed to incorrect queues. This disruption primarily impacted our queue management system, leading to delays in conversation ingestion due to manual intervention.
Timeline
- October 1, 2024, 13:07 CEST: Incident begins. Conversations start landing in incorrect queues.
- October 1, 2024, 14:30 CEST: The issue is mitigated, and conversations no longer land in incorrect queues.
- October 1, 2024, 15:00 CEST: Impact increases as some queues become blocked, preventing new conversations from being offered.
- October 2, 2024, 18:00 CEST: Hot-fixes are implemented, restoring queue routing functionality and unblocking queues.
- October 3, 2024, 20:20 CEST: Full service is restored.
Root Cause
The disruption was caused by a change in the encoding protocol for the queue cache, which led to incorrect routing of conversations. Though this was quickly mitigated, queues became blocked later in the day, preventing new conversation offerings. There were no data inaccuracies; however, manual intervention was necessary to address the incorrect routing, which delayed conversation ingestion on October 1, 2024.
Impact
- Some queues were blocked, preventing new conversation offerings.
- There were delays in conversation ingestion due to the need for manual intervention.
Resolution
- Hot-fixes were applied to the queue and offer services on October 2, 2024, restoring routing functionality and unblocking queues.
- Full service was restored by October 3, 2024.
Aftermath and Long-term Actions
To avoid future incidents, the following actions will be taken:
- Protocol Review: Review and refine the encoding protocol to ensure stable queue operations.
- Queue Service Improvements: Implement enhancements to prevent future misrouting and the need for manual intervention.
- Enhanced Monitoring: Improve monitoring systems to detect routing issues earlier.
- Post-Incident Review: Conduct a retrospective to gather insights and improve overall system resilience.
Conclusion
This incident highlighted areas for improvement in our queue management system, particularly in handling protocol changes. While the immediate issue was resolved with hot-fixes, we are taking proactive steps to enhance the system’s resilience and ensure smooth conversation routing and ingestion moving forward.
Dixa Incident Management Team
Date: October 4, 2024