Mim Data Source Sync Failures

Incident Report for Dixa

Postmortem

Date: August 22, 2025
Duration: 5 days (August 22-27, 2025)
Severity: Degraded

Summary

Between August 22-27, 2025, our AI assistant (Mim) experienced degraded response quality due to inconsistent knowledge base indexing failures. The incident was caused by our OpenSearch cluster rejecting document index requests due to memory constraints from oversized embeddings. While Mim continued to operate with partial knowledge data, customers experienced reduced response accuracy for 5 days until the issue was resolved through cluster scaling.

Timeline

  • August 22, 2025: First indexing failures begin occurring
  • August 22-25, 2025: Indexing failures continue to grow in frequency
  • August 25, 2025: Issue escalated by CS team
  • August 27, 2025 - 18:00 CEST: Indexing failures stop occurring, and new knowledge bases can be indexed successfully
  • August 28, 2025: re-indexed existing Dixa Knowledge and elevio data sources
  • August 28, 2025 - 12:00: Full service restoration achieved

Root Cause

The system was unable to process knowledge base indexing requests due to insufficient computational resources relative to the data volume and complexity. Contributing factors included:

  • Inadequate system capacity for the current workload demands
  • Inefficient resource utilization patterns
  • Suboptimal data processing architecture that doesn't scale effectively with growth
  • Large data structures requiring more system resources than available

Impact

  • Duration: 5 days (August 22-27, 2025)
  • Service Level: Degraded (not complete outage)
  • User Experience: Mim responses had reduced accuracy and completeness due to operating with partial knowledge data

Resolution

Immediate Actions Taken

  1. System Cleanup: Removed unused data and optimized storage utilization
  2. Capacity Scaling: Increased available computational resources to handle workload demands
  3. Performance Optimization: Balanced system performance improvements with operational efficiency

Long-term Improvements Planned

  1. Architecture Enhancement:
* Redesign data processing workflows for improved scalability
* Implement more efficient system resource management
  1. Data Optimization:
* Evaluate opportunities to reduce data processing overhead
* Optimize data formats for better system performance
  1. Monitoring Enhancement:
* Implement proactive alerting for system capacity issues
* Add performance monitoring to prevent resource bottlenecks
Posted Aug 28, 2025 - 15:46 CEST

Resolved

The incident has been fully resolved, and all services are operating normally. We sincerely apologize for the disruption and any impact this may have had on your operations. We appreciate your patience during this time.
Posted Aug 28, 2025 - 08:52 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Aug 27, 2025 - 18:08 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted Aug 27, 2025 - 11:06 CEST

Investigating

We are currently experiencing issues with new Mim synchronizations to data sources. While all previously synchronized data remains accessible and functional, new sync operations are failing to complete successfully.

IMPACT:
New Mim sync operations created since August 22 are failing
Existing synchronized data continues to work normally
Automatic syncs are not updating with new content
Manual re-sync operations may disrupt existing working connections

STATUS:
Our engineering team is actively investigating the root cause of this issue. We strongly advise against initiating manual re-syncs at this time, as this may interrupt existing functional data connections.
Posted Aug 26, 2025 - 16:30 CEST
This incident affected: AI Features (Mim AI agent).