Post Mortem - Outbound SMTP Service Degradation on Oct 24
On October 24th between 15:32 and 17:25 GMT time, there was an outage in our Outbound Filtering service. This outage resulted in our SMTP service not accepting connection from upstream servers, resulting in increases queues on the customers servers. To our knowledge, no email messages have been lost. We have now identified the root cause for these delays.
What happened?
A configuration error was introduced into our policy system during a routine deployment of new software. This resulted in degradation of a core policy system resulting in the service accepting new SMTP connections from customers’ servers very slowly. Our monitoring systems notified us of an issue at 15:32 GMT time - within one minute of the error being introduced. We quickly identified the root cause and started rolling back the deployment that caused the issue. By 17:25 GMT time, all the services were completely restored.
We are looking into how a failure like this can be avoided in the future and how we can recover faster in case there are issues with new deployments.