Post-Mortem: Ory Network Outage

Adam Wałach -

Time to read: 2 min

Incident Report: Ory Network Outage

Dear Customers and Users,

We’d like to share a post-mortem about the partial outage of Ory Network services on 2024-02-28. We are sincerely sorry for the inconvenience caused by the incident and share this write-up to describe actions we have taken to prevent such an incident in the future.

Incident Summary

On February 28, 2024, the Ory Network experienced a service disruption affecting multiple APIs and the Account Experience UI across the europe-west3 region. The disruption was due to an unexpected increase in CPU usage following a scheduled release, leading to service downtime.

Timeline

  1. 2024-02-28 13:34 CET: The incident began - APIs started responding with error code 500.
  2. 2024-02-28 13:36 CET: Engineers received notification.
  3. 2024-02-28 13:42 CET: Scaling up of infrastructure in the EU.
  4. 2024-02-28 13:49 CET: Login & Identity systems restored.
  5. 2024-02-28 13:53 CET: All systems fully operational.

Impact

The affected services included:

  • Login & Identities APIs
  • OAuth2 & OIDC APIs
  • Permission APIs
  • Account Experience UI

Users experienced timeouts and error responses (code 500) from the aforementioned services during the incident.

Root Cause Analysis

On each release, Ory Network re-generates the configuration for all projects. This happens asynchronously after a service loads & uses a config, and determines, that the config was generated by an older software rollout. The goal is an eventually consistent update, so that following requests will have an up-to-date config available.

Right during the release, an uexpectedly large number of concurrent incoming requests arrived and triggered the background rendering job, temporarily consuming all CPU resources of the afflicted Kubernetes nodes.

Our automatic horizontal scaling of the affected service scaled the services immediately following the increased CPU usage, but failed to add additional nodes to the cluster as opposed to increasing the number of replicas of the service on the existing nodes.

The resource consumption problem was exacerbated by a misconfigured retry policy, which would re-trigger a timed-out rendering request with too little backoff.

Resolution and Recovery

Our Site Reliability Engineering (SRE) team responded to the incident by adding additional compute resources to handle the increased load, after which the backlog of configuration rendering operations was rapidly worked through and operations were restored shortly thereafter.

Preventive Measures

In response to this incident, we are taking steps to prevent future occurrences. These measures include:

  • CPU-intensive operations will be moved to dedicated compute nodes running them exclusively, to prevent them from impacting the availability of other services.
  • These job servers will be fronted by an improved queueing mechanism to ensure they are not overwhelmed.
  • The retry policy of the job queueing will be adjusted to avoid the thundering herd problem encountered during this incident.
  • Kubernetes CPU requests on all workloads will be increased, and additional compute capacity permanently added the the cluster.
  • The existing mechanism of continuing request processing with an outdated configuration revision will be re-tested, to confirm it works as designed.

We are committed to maintaining the reliability and performance of the Ory Network and will continue to update our processes and infrastructure based on the lessons learned from this incident.

Conclusion

We apologize for any inconvenience caused by this outage and appreciate your understanding as we work to enhance our systems. Our priority remains to provide reliable and secure services to all our users.