Summary
Ory has ambitious goals for Ory Cloud, and we heavily invest in the reliability of our services. However, we suffered an outage of our identity services APIs on 2022/07/30 from 02:44:15 to 10:53:11.
In short, it was a combination of several smaller problems: a minor version update of one of our database clusters in combination with the connection pool of our identity services not removing unhealthy connections and a faulty health check configuration of our identity services and a bug in our monitoring and notification system caused an outage of our identity services and a severely delayed response by our SRE team.
We added automation to ensure that such an outage will not occur again.
Timeline
[2022/07/30 02:41]
A minor version upgrade on one of our database
clusters starts.
[2022/07/30 02:43]
Due to the restarting database nodes, database
connections in the identity service become unhealthy but are not removed from
the connection pool. Due to a misconfiguration of the health check, the affected
service instances are flagged as "not ready" but not as "unhealthy". This
prevents them from being restarted.
A misconfiguration in the Identity Service health check prevents a notification about this problem.
[2022/07/31 02:44]
Our external monitoring system notices that the
Identity Management APIs are becoming unresponsive and updates our status page.
But since the instances are not marked as unhealthy, the on-call chain does not
trigger.
[2022/07/30 02:45]
Our external monitoring system notices the Session
Check API are becoming unresponsive and updates our status page. Since the
instances are not marked as unhealthy, the on-call chain does not trigger.
[2022/07/30 10:40]
Our SRE team acknowledges the problem in Slack and
starts the investigation.
[2022/07/30 10:44]
We determine that the Identity Services database
connections are not healthy.
[2022/07/30 10:45]
We trigger a restart of all Identify Management
service instances.
[2022/07/30 10:47]
All restarted instances are able to connect to the
database cluster and start handling requests again.
[2022/07/30 10:50]
Our external monitoring system detects that the
Session Check API has fully recovered.
[2022/07/30 10:53]
Our external monitoring system detects that the
Identity Management APIs have fully recovered. 13 minutes after the SRE team
started the investigation the Identity Management APIs are fully restored.
Remediations
In response to these events, we updated the health check configuration of our Identity Services to recognize unhealthy instances better and restart them automatically. We are updating the database pool management to remove unhealthy database connections. We updated the external monitoring system to always trigger the on-call chain on any outage or degradation of any API. We performed extensive tests validating that with these fixes in place, Ory’s services automatically recover unhealthy nodes, and the internal monitoring is triggering the on-call chain correctly on outages.
With these remediations in place, we are confident that an event like this will not happen again.
We are working on or planning several additional initiatives to harden our infrastructure. We are setting up additional checkpoints that notify us if any of the monitoring and notification automation integrations don't function. We are fast-tracking Chaos Monkey testing to start next month.