Intermittent telephony issues

Incident Report for Dixa

Postmortem

Summary

On November 5, 2018, multiple parts of our primary carrier intermittently failed to reached our platform, or experienced significant latency when doing so, due to a callback pipeline capacity issue in the carrier’s HTTP proxying and caching stack. This issue was caused by the combination of seasonally high load and excessive callback connections created due to redirects and retries. These issues occurred intermittently in periods from 14:47 UTC to 18:04 UTC, totaling 109 minutes of degraded service. The telephony part of our platform, which depends on HTTP callbacks from the carrier to function correctly, suffered a partial outage as a result.

Timeline

14:47 UTC: The carrier’s monitoring systems alert to an increased number of HTTP callback failures and increased callback latency across their infrastructure,
15:09 UTC: Issue subsides, the carrier indicates the issue is resolved,
15:52 UTC: HTTP callback failures and latency increase again, the carrier reopens the incident,
16:16 UTC: the carrier’s engineering increases callback pipeline capacity,
17:35 UTC: the carrier’s engineering identifies an unusually high number of HTTP 301 redirect responses to its HTTP status callbacks and begins investigating this as a contributing factor,
17:41 UTC: the carrier makes a temporary change to limit the number of HTTP redirects processed through the system; failure rates return to normal levels,
19:27 UTC: the carrier deploys an additional change to mitigate the issue, further limiting the maximum number of redirects allowed,
20:05 UTC: All operating indicators remained healthy and no recurrence of the issue for two hours; the carrier marks the incident as resolved.

Root cause

All of the carrier’s infrastructure uses a shared HTTP callback pipeline to process status callback and webhook requests. During the incident, an increase in the number of HTTP callbacks and excessive HTTP connections created to serve redirects and retries consumed the available TCP/IP ports in the callback pipeline's caching layer and caused the outage.

Due to seasonal demand, the total number of HTTP callbacks handled by this system increased. Additionally, the number of HTTP 301 redirect responses from customer servers also increased significantly. These redirect responses in particular along with increased overall load resulted in an exponential increase in the number of TCP/IP ports required by the carrier’s callback pipeline. The increase eventually exhausted all ports available to the HTTP callback pipeline, which meant the carrier was unable to make new status callback or webhook requests to our servers, resulting in the outage.

Resolution plan

Immediately following the incident, the carrier increased the capacity of their HTTP proxy and caching systems. They also deployed a change to reduce the maximum number of HTTP 301 redirects that part of their systems allows, which will significantly reduce the network load on these systems. The carrier has also set more aggressive alerting thresholds for callback pipeline capacity issues and for increased latency in HTTP callback responses, giving their engineering team advanced visibility into any recurrence of these problems. Their on-call team will continue operating under heightened awareness protocol for the duration of the seasonal high-volume period.

Long-term, the carrier will make improvements to make their HTTP proxy and caching systems more resilient. They will improve the callback pipeline so that it scales automatically with increased callback load. They will modify the system so that one-way status callbacks are routed separately from webhook requests that require a response, thereby eliminating the need for a common caching layer. They will also review the way they handle HTTP redirects system-wide and reduce the number of allowed redirects wherever possible.

Posted Nov 06, 2018 - 13:08 CET

Resolved

After monitoring the platform for multiple hours, we are confident that the instabilities have been resolved. We will publish a post-mortem as soon as we have access to all the details.

Posted Nov 05, 2018 - 21:07 CET

Monitoring

Our carrier has identified the remaining issue and deployed a fix. We will keep monitoring the situation.

Posted Nov 05, 2018 - 19:16 CET

Identified

Our carrier reports that the issue has not been fully resolved. They have identified the issue and are currently working on fixing it.

Posted Nov 05, 2018 - 18:32 CET

Monitoring

Our carrier has confirmed that the issue has been resolved, and they will continue to monitor the situation throughout the next 24 hours. We have requested an RFO and will update this page as soon as we have more information. We again apologise for the inconvenience that this has caused.

Posted Nov 05, 2018 - 18:19 CET

Update

Inbound and outbound services are now less affected by the outage. Some calls might still fail, but it would appear the majority are now going through normally. We are waiting for confirmation from our carrier.

Posted Nov 05, 2018 - 18:09 CET

Identified

One of our carriers is currently experiencing elevated error levels. This results in calls to and from our platform to end with an error message "An application error has occured". We are currently working with the carrier to resolve the issue.

Posted Nov 05, 2018 - 17:35 CET

This incident affected: Telephony & SMS (Inbound, Outbound, WebRTC).