Post-Mortem: Partial outage on 2022-08-11

Andreas Bucksteeg - Last updated at August 08, 2022

Time to read: 1 min

Summary

Ory has ambitious goals for Ory Cloud, and we heavily invest in the reliability of our services. However, we suffered a partial outage of a subgroup of APIs that affected a small group of our users on 2022/08/11 from 09:00 to 19:00. The issue affected managing the Ory Cloud Project. All API calls/operations relevant to serving the customers of our customers were not impacted and were fully operational.

The issue was caused by the configuration of the Web Application Firewall (WAF) which was too restrictive. As a result, the request payload of certain API calls exceeded the configured limit of the WAF (ModSecurity). The search was prolonged as our Ingress Controller, which integrates with the WAF, didn't handle the error transparently and displayed http_response_code 200 for the failed requests. We filed a bug report with the Ingress vendor. Due to the limited effect of the outage, our monitoring picked up only sporadic data points about it, which prolonged the process of pinpointing the problem.

Timeline

[2022/08/11 09:00]
Users reported problems with accessing Ory Console.

[2022/08/11 09:30]
Investigation started. Since only two users reported the problem, we suspected that it is caused by the client's environment - browser extension, firewall, or network configuration. We were not able to reproduce it on our accounts.

[2022/08/11 15:40]
Another user reported the same problem and provided request details. That allowed to trigger the failing request on demand with curl. At this point, the investigation focused on the WAF and rate limiting.

[2022/08/11 19:20]
Issue mitigated by rolling out a hotfix to the production environment.

Remediations

We reconfigured our Web Application Firewall and modified monitoring rules to avoid this kind of issues in the future. We are in contact with our WAF vendor to verify if this class of potential problems can be mitigated to ensure that such an outage will not occur again.