Ory has ambitious goals for Ory Cloud, and we heavily invest in the reliability of our services. However, we suffered an outage of our identity services APIs on 2022/07/30 from 02:44:15 to 10:53:11.
In short, it was a combination of several smaller problems: a minor version update of one of our database clusters in combination with the connection pool of our identity services not removing unhealthy connections and a faulty health check configuration of our identity services and a bug in our monitoring and notification system caused an outage of our identity services and a severely delayed response by our SRE team.
We added automation to ensure that such an outage will not occur again.
A minor version upgrade on one of our database clusters starts.
Due to the restarting database nodes, database connections in the identity service become unhealthy but are not removed from the connection pool. Due to a misconfiguration of the health check, the affected service instances are flagged as "not ready" but not as "unhealthy". This prevents them from being restarted.
A misconfiguration in the Identity Service health check prevents a notification about this problem.
Our external monitoring system notices that the Identity Management APIs are becoming unresponsive and updates our status page. But since the instances are not marked as unhealthy, the on-call chain does not trigger.
Our external monitoring system notices the Session Check API are becoming unresponsive and updates our status page. Since the instances are not marked as unhealthy, the on-call chain does not trigger.
Our SRE team acknowledges the problem in Slack and starts the investigation.
We determine that the Identity Services database connections are not healthy.
We trigger a restart of all Identify Management service instances.
All restarted instances are able to connect to the database cluster and start handling requests again.
Our external monitoring system detects that the Session Check API has fully recovered.
Our external monitoring system detects that the Identity Management APIs have fully recovered. 13 minutes after the SRE team started the investigation the Identity Management APIs are fully restored.
In response to these events, we updated the health check configuration of our Identity Services to recognize unhealthy instances better and restart them automatically. We are updating the database pool management to remove unhealthy database connections. We updated the external monitoring system to always trigger the on-call chain on any outage or degradation of any API. We performed extensive tests validating that with these fixes in place, Ory’s services automatically recover unhealthy nodes, and the internal monitoring is triggering the on-call chain correctly on outages.
With these remediations in place, we are confident that an event like this will not happen again.
We are working on or planning several additional initiatives to harden our infrastructure. We are setting up additional checkpoints that notify us if any of the monitoring and notification automation integrations don't function. We are fast-tracking Chaos Monkey testing to start next month.