On November 5, 2018, multiple parts of our primary carrier intermittently failed to reached our platform, or experienced significant latency when doing so, due to a callback pipeline capacity issue in the carrier’s HTTP proxying and caching stack. This issue was caused by the combination of seasonally high load and excessive callback connections created due to redirects and retries. These issues occurred intermittently in periods from 14:47 UTC to 18:04 UTC, totaling 109 minutes of degraded service. The telephony part of our platform, which depends on HTTP callbacks from the carrier to function correctly, suffered a partial outage as a result.
All of the carrier’s infrastructure uses a shared HTTP callback pipeline to process status callback and webhook requests. During the incident, an increase in the number of HTTP callbacks and excessive HTTP connections created to serve redirects and retries consumed the available TCP/IP ports in the callback pipeline's caching layer and caused the outage.
Due to seasonal demand, the total number of HTTP callbacks handled by this system increased. Additionally, the number of HTTP 301 redirect responses from customer servers also increased significantly. These redirect responses in particular along with increased overall load resulted in an exponential increase in the number of TCP/IP ports required by the carrier’s callback pipeline. The increase eventually exhausted all ports available to the HTTP callback pipeline, which meant the carrier was unable to make new status callback or webhook requests to our servers, resulting in the outage.
Immediately following the incident, the carrier increased the capacity of their HTTP proxy and caching systems. They also deployed a change to reduce the maximum number of HTTP 301 redirects that part of their systems allows, which will significantly reduce the network load on these systems. The carrier has also set more aggressive alerting thresholds for callback pipeline capacity issues and for increased latency in HTTP callback responses, giving their engineering team advanced visibility into any recurrence of these problems. Their on-call team will continue operating under heightened awareness protocol for the duration of the seasonal high-volume period.
Long-term, the carrier will make improvements to make their HTTP proxy and caching systems more resilient. They will improve the callback pipeline so that it scales automatically with increased callback load. They will modify the system so that one-way status callbacks are routed separately from webhook requests that require a response, thereby eliminating the need for a common caching layer. They will also review the way they handle HTTP redirects system-wide and reduce the number of allowed redirects wherever possible.