Inbound emails are delayed
Incident Report for Dixa
Postmortem

Summary

Starting on February 4, 2019, our main email provider experienced an incident that took down our ability to process inbound emails, due to the failure of one of the nodes in their PostgreSQL cluster, causing said database to switch into read only mode. The last email received by our platform email was at 04:21 UTC, and normal email operation resumed on February 6, 2019 at around 00:45 UTC.

Timeline

  • 2018-02-04 04:21 UTC: The last webhook from the email provider is received by the Dixa platform,

  • 2018-02-04 06:58 UTC: The Dixa Engineering team starts investigating the delay in email deliveries,

  • 2018-02-04 08:48 UTC: We open a critical support ticket with the email provider,

  • 2018-02-04 08:53 UTC: We escalate the ticket to the email provider’s executives (no acknowledgement),

  • 2018-02-04 09:24 UTC: We escalate the ticket to the email provider’s parent company (no acknowledgement),

  • 2018-02-04 10:09 UTC: The Dixa Engineering team starts developing an alternative system to the current email provider,

  • 2018-02-04 15:55 UTC: The Dixa engineering team starts performing the first integration tests with the alternative system,

  • 2018-02-04 15:56 UTC: The email provider acknowledges the support ticket,

  • 2018-02-05 11:56 UTC: The Dixa Engineering team finishes deploying the infrastructure required to replace the email provider. The team continues to develop the new software components,

  • 2018-02-05 13:15 UTC: The Dixa platform receives a single large delivery of emails from the email provider, totaling a few hundred emails,

  • 2018-02-05 14:12 UTC: The Dixa platform receives another single large delivery of emails from the email provider, totaling a few hundred emails,

  • 2018-02-05 16:07 UTC: The email provider sends an update with a technical description of the problem. The email indicates that the problem will take multiple days to be resolved,

  • 2018-02-05 21:40 UTC: The Dixa platform receives multiple large delivery of emails from the email provider, totaling a couple thousand emails,

  • 2018-02-05 22:09 UTC: The email provider announces that the issue has been resolved,

  • 2018-02-05 23:30 UTC: Emails are starting to be delivered to the Dixa platform, a few hundred emails per minute,

  • 2018-02-06 00:45 UTC: The backlog of emails has been delivered by the email provider. New emails are coming in normally.

Root cause

Our email provider uses a sharded PostgreSQL setup as one of its main data stores. On Sunday, February 4, at 04:30 UTC, one of the five physical PostgreSQL instances saw a significant spike in writes. This spike in writes triggered a Transaction ID wraparound issue. When this occurs, database activity is completely halted. The database set itself in read-only mode until offline maintenance (known as vacuuming) can occur. Because the database is large, running the vacuum process was expected to take a significant amount of time and resources.

In addition, during the recovery efforts, the email provider’s team noticed that some data was blocking the recovery efforts. They decided to delete the data, in order to bring the whole system back online. This means that a portion of emails that your customers sent you during the outage might be lost.

Dixa chose said email provider many years ago for its reliable track record, and because the parent company is a trusted industry leader. In the past few years of using this email provider, Dixa has never witnessed any outage this critical.

Because email systems rely on DNS, MX records and trust between machines, it is not industry standard to have multiple fallback email providers that one can swap out on the fly. Email is designed to be resilient to server downtime (if an email cannot be delivered, it will be retried again later, over the course of multiple days). However, in this case, the email provider kept accepting incoming emails, with no ability to store them.

Resolution plan

As indicated in the timeline above, the Dixa Engineering team very quickly started working on an alternative system. The reason why our team was able to so quickly progress with this unexpected work was that it was originally planned for 2019Q3, and a significant amount of research had already been done in this regard. As of writing, the new system is not production ready yet, but we hope to deploy it in trial mode in the coming days. After studying the problem, we have decided that we will be running our email system in a "dual-stack" mode, with the ability to quickly fall back to a secondary email provider in case the primary one fails.

Posted 14 days ago. Feb 07, 2019 - 14:22 CET

Resolved
The underlying issue has been resolved by our supplier. We confirm that emails are arriving as expected without delay, and we expect normal operations going forward. We keep monitoring the situation closely.
Posted 15 days ago. Feb 06, 2019 - 06:21 CET
Monitoring
We have observed regular delivery of incoming emails for the past few hours, which is a very positive development.
All incoming emails delivered to Dixa are being processed correctly.
We expect the situation to stabilize further within the next few hours.

We will follow up at 7 AM tomorrow morning.
Posted 15 days ago. Feb 06, 2019 - 01:59 CET
Update
We have received a number of email batches from the incumbent email provider, totaling a few hundred emails. These batches appear to be efforts from the provider to resolve the issue, but we still have not received any ETA regarding a permanent fix. We are continuing our efforts to remove the use of said carrier from our infrastructure, in order to get the inbound email portion of our service back up and running.
Regardless of whether the incumbent email provider fixes the issue or not, we will be moving away from them as soon as possible. We will provide another status update in the next few hours.
Posted 16 days ago. Feb 05, 2019 - 15:57 CET
Update
Our engineering team has identified and selected a change to our infrastructure that will remove the affected carrier from our stack. The team is currently implementing the change and has started the integration testing. This is a significant amount of work that was not intended to happen before 2019Q3. We will update this page again at 16:00 CET with further information and a progress update.
Posted 16 days ago. Feb 05, 2019 - 14:04 CET
Update
In parallel to working closely with our email provider, we are currently investigating every technically feasible alternative to resolve this issue. Considering that we have not received an ETA to resolution from our current email provider, we have engaged the entire engineering team to implement a workaround that will re-enable the inbound email channel. We will provide another update at 2PM CET.
Posted 16 days ago. Feb 05, 2019 - 12:09 CET
Update
Mandrill, our email channel integration partner, is experiencing a global outage affecting 3000 customers. We are doing all we can, but no workaround is currently possible until the underlying issue is fixed by Mandrill.

https://mobile.twitter.com/mandrillapp

We will keep you posted during the day until we have a solution.
To get in touch with Dixa Support, please give us a call or use the chat widget in the product.
Posted 16 days ago. Feb 05, 2019 - 09:02 CET
Update
Unfortunately, the issue with delayed inbound emails at our third-party provider still persists. We will update here immediately when there are new developments.
Posted 16 days ago. Feb 05, 2019 - 06:50 CET
Update
We are continuing to work with our email partner to resolve the issue. All of our engineering efforts are focused on this. We apologise for the disruption this is causing.
Posted 17 days ago. Feb 04, 2019 - 15:22 CET
Identified
We have confirmation that the delay is caused by a third-party supplier. We are working with their support team to resolve the issue as fast as possible.
Posted 17 days ago. Feb 04, 2019 - 10:06 CET
Investigating
We are investigating a delay in the processing of inbound emails. New emails currently do not show up in Dixa.
Posted 17 days ago. Feb 04, 2019 - 09:39 CET
This incident affected: Email (Inbound).