Starting on February 4, 2019, our main email provider experienced an incident that took down our ability to process inbound emails, due to the failure of one of the nodes in their PostgreSQL cluster, causing said database to switch into read only mode. The last email received by our platform email was at 04:21 UTC, and normal email operation resumed on February 6, 2019 at around 00:45 UTC.
2018-02-04 04:21 UTC: The last webhook from the email provider is received by the Dixa platform,
2018-02-04 06:58 UTC: The Dixa Engineering team starts investigating the delay in email deliveries,
2018-02-04 08:48 UTC: We open a critical support ticket with the email provider,
2018-02-04 08:53 UTC: We escalate the ticket to the email provider’s executives (no acknowledgement),
2018-02-04 09:24 UTC: We escalate the ticket to the email provider’s parent company (no acknowledgement),
2018-02-04 10:09 UTC: The Dixa Engineering team starts developing an alternative system to the current email provider,
2018-02-04 15:55 UTC: The Dixa engineering team starts performing the first integration tests with the alternative system,
2018-02-04 15:56 UTC: The email provider acknowledges the support ticket,
2018-02-05 11:56 UTC: The Dixa Engineering team finishes deploying the infrastructure required to replace the email provider. The team continues to develop the new software components,
2018-02-05 13:15 UTC: The Dixa platform receives a single large delivery of emails from the email provider, totaling a few hundred emails,
2018-02-05 14:12 UTC: The Dixa platform receives another single large delivery of emails from the email provider, totaling a few hundred emails,
2018-02-05 16:07 UTC: The email provider sends an update with a technical description of the problem. The email indicates that the problem will take multiple days to be resolved,
2018-02-05 21:40 UTC: The Dixa platform receives multiple large delivery of emails from the email provider, totaling a couple thousand emails,
2018-02-05 22:09 UTC: The email provider announces that the issue has been resolved,
2018-02-05 23:30 UTC: Emails are starting to be delivered to the Dixa platform, a few hundred emails per minute,
2018-02-06 00:45 UTC: The backlog of emails has been delivered by the email provider. New emails are coming in normally.
Our email provider uses a sharded PostgreSQL setup as one of its main data stores. On Sunday, February 4, at 04:30 UTC, one of the five physical PostgreSQL instances saw a significant spike in writes. This spike in writes triggered a Transaction ID wraparound issue. When this occurs, database activity is completely halted. The database set itself in read-only mode until offline maintenance (known as vacuuming) can occur. Because the database is large, running the vacuum process was expected to take a significant amount of time and resources.
In addition, during the recovery efforts, the email provider’s team noticed that some data was blocking the recovery efforts. They decided to delete the data, in order to bring the whole system back online. This means that a portion of emails that your customers sent you during the outage might be lost.
Dixa chose said email provider many years ago for its reliable track record, and because the parent company is a trusted industry leader. In the past few years of using this email provider, Dixa has never witnessed any outage this critical.
Because email systems rely on DNS, MX records and trust between machines, it is not industry standard to have multiple fallback email providers that one can swap out on the fly. Email is designed to be resilient to server downtime (if an email cannot be delivered, it will be retried again later, over the course of multiple days). However, in this case, the email provider kept accepting incoming emails, with no ability to store them.
As indicated in the timeline above, the Dixa Engineering team very quickly started working on an alternative system. The reason why our team was able to so quickly progress with this unexpected work was that it was originally planned for 2019Q3, and a significant amount of research had already been done in this regard. As of writing, the new system is not production ready yet, but we hope to deploy it in trial mode in the coming days. After studying the problem, we have decided that we will be running our email system in a "dual-stack" mode, with the ability to quickly fall back to a secondary email provider in case the primary one fails.