Post-Mortem: Service Degradation on 2023-10-29
A post-mortem for the Ory Network EMail Delivery Service Degradation on the 29/10/2023.
Incident Report: EMail Delivery Service Degradation
Dear Customers and Users,
We’d like to share a post-mortem about the event on October 29th. We are sincerely sorry for the inconvenience caused by the incident and share this write-up to describe actions we have taken to prevent such an incident in future.
Incident Summary
Incident Start: 2023-10-29 21:05 CET Incident End: 2023-10-30 15:54 CET
Services Affected: Ory Network Login & Identities APIs: The processing of the email queue of Ory Network was significantly delayed. This impacted all Ory Network projects, regardless of whether a custom email delivery or Ory’s preconfigured service was used.
Root Cause Analysis
The root cause was a combination of partial blocking by a third-party email delivery service, incorrect error handling, and elevated traffic, causing a large number of emails to be sent out.
Resolution and Recovery
The messages that were blocked were removed from the queue manually. The customer whose configuration was causing blocked messages was contacted, and their email delivery was disabled. The worker then started to process all outstanding messages, and the message queue was back to normal levels around 2023-10-30 15:54 CET (with a constant influx of new messages).
Preventive Measures
- Improved/Fixed error handling (implemented)
- Improve email worker efficienc (implemented)
- Improved tests to ensure changes in error-handling of third-party email library fail tests (in development)
- Improve monitoring of and alerting on message queue pressure (in development)
- Improve parallel processing of messages to reduce message latency (in development)