Post-Mortem: Outage on 2022-11-21

Patrik Neu - Last updated at November 24, 2022

Time to read: 1 min

Summary

Ory has ambitious goals for Ory Network, and we heavily invest in the reliability of our services. However, we suffered an outage of our identity, permission, and federation APIs that affected most users on 2022/11/21 from 10:13 to 10:21. The issue affected all API calls/operations relevant to serving the customers of our customers.

The issue was caused by Ory's configuration change applied to a customer project. This incorrect configuration caused all instances of one of our services to end up in a crash loop, either responding with an error code or not responding.

Timeline

[2022/11/21 10:10]
The manual update to a customer's subscription plan was triggered.

[2022/11/21 10:12]
Our monitoring detects the problem.

[2022/11/21 10:15]
Our monitoring triggers an on-call alarm, and the investigation starts.

[2022/11/21 10:15]
The service instances causing the problem are identified and the log analysis starts.

[2022/11/21 10:19]
The problem has been identified and the mitigation action gets prepared.

[2022/11/21 10:21]
The update is reverted, and the service instances stabilize. Monitoring reports all services running without problems again.

Remediations

The underlying problem was that custom subscription plans currently require manual updates by our operations team. With the human factor involved, there is always a risk of typos and copy-paste errors. We are improving subscription management so that no manual configurations are required. Furthermore, we enabled a recovery middleware to ensure the services are more resilient and respond with a healthy default to ensure the system's stability. With these changes in place, a similar faulty configuration would not impact the stability of our services