Post-Mortem: Degradated Service on 2022-12-13

Patrik Neu - Last updated at December 19, 2022

Time to read: 1 min

Summary

Ory has ambitious goals for Ory Network, and we heavily invest in the reliability of our services. However, the platform experienced elevated latency on 2022/12/13 from 10:13 to 10:21. The issue affected all API calls/operations that hit the database at some point.

The issue was caused by resource exhaustion of our database system. The root cause were missing indexes that made identity deletion queries consume a lot of memory. Specifically the cascading queries were causing full-table scans. During the timeframe of the incident, one customer was doing a lot of these expensive queries, causing a performance degradation of the whole system.

Timeline (CET/UTC+1)

[2022/12/13 9:25]
A customer starts to delete multiple identities in a row.

[2022/12/13 9:45]
We are observing high latency of our production system and start the investigation.

[2022/12/13 10:29]
Our investigation of traffic and log statements did not reveal any relevant info. We are now focusing on SQL queries and database performance.

[2022/12/13 10:35]
The customer stopped to delete identities for ~15m. During that time, the latency is back to normal.

[2022/12/13 11:30]
After confirming with our database vendor that we can scale nodes vertically without downtime, we increase the resources for short-term mitigation.

[2022/12/13 11:40]
We identified the implicit cascading queries that were causing full-table scans. We create indexes for the missing columns on our staging system and verify the fix.

[2022/12/13 11:45]
The database nodes are scaled up and the latency drops significantly to acceptable levels.

[2022/12/13 11:50]
We apply the fix to the production system and therefore eliminate the root cause. We continue to closely monitor the system. No further issues surface.

Remediations

We created the missing indexes to prevent the huge memory consumption of the identity delete operation. Further, we created alerts based on service latency to earlier detect similar incidents in the future.