Date: August 22, 2025
Duration: 5 days (August 22-27, 2025)
Severity: Degraded
Summary
Between August 22-27, 2025, our AI assistant (Mim) experienced degraded response quality due to inconsistent knowledge base indexing failures. The incident was caused by our OpenSearch cluster rejecting document index requests due to memory constraints from oversized embeddings. While Mim continued to operate with partial knowledge data, customers experienced reduced response accuracy for 5 days until the issue was resolved through cluster scaling.
Timeline
- August 22, 2025: First indexing failures begin occurring
- August 22-25, 2025: Indexing failures continue to grow in frequency
- August 25, 2025: Issue escalated by CS team
- August 27, 2025 - 18:00 CEST: Indexing failures stop occurring, and new knowledge bases can be indexed successfully
- August 28, 2025: re-indexed existing Dixa Knowledge and elevio data sources
- August 28, 2025 - 12:00: Full service restoration achieved
Root Cause
The system was unable to process knowledge base indexing requests due to insufficient computational resources relative to the data volume and complexity. Contributing factors included:
- Inadequate system capacity for the current workload demands
- Inefficient resource utilization patterns
- Suboptimal data processing architecture that doesn't scale effectively with growth
- Large data structures requiring more system resources than available
Impact
- Duration: 5 days (August 22-27, 2025)
- Service Level: Degraded (not complete outage)
- User Experience: Mim responses had reduced accuracy and completeness due to operating with partial knowledge data
Resolution
Immediate Actions Taken
- System Cleanup: Removed unused data and optimized storage utilization
- Capacity Scaling: Increased available computational resources to handle workload demands
- Performance Optimization: Balanced system performance improvements with operational efficiency
Long-term Improvements Planned
- Architecture Enhancement:
* Redesign data processing workflows for improved scalability
* Implement more efficient system resource management
- Data Optimization:
* Evaluate opportunities to reduce data processing overhead
* Optimize data formats for better system performance
- Monitoring Enhancement:
* Implement proactive alerting for system capacity issues
* Add performance monitoring to prevent resource bottlenecks