Summary
Ory has ambitious goals for Ory Cloud, and we heavily invest in the reliability of our services. However, we suffered a partial outage of a subgroup of APIs that affected a small group of our users on 2022/08/11 from 09:00 to 19:00. The issue affected managing the Ory Cloud Project. All API calls/operations relevant to serving the customers of our customers were not impacted and were fully operational.
The issue was caused by the configuration of the Web Application Firewall (WAF)
which was too restrictive. As a result, the request payload of certain API calls
exceeded the configured limit of the WAF (ModSecurity). The search was prolonged
as our Ingress Controller, which integrates with the WAF, didn't handle the
error transparently and displayed http_response_code 200
for the failed
requests. We filed a bug report with the Ingress vendor. Due to the limited
effect of the outage, our monitoring picked up only sporadic data points about
it, which prolonged the process of pinpointing the problem.
Timeline
[2022/08/11 09:00]
Users reported problems with accessing Ory Console.
[2022/08/11 09:30]
Investigation started. Since only two users reported
the problem, we suspected that it is caused by the client's environment -
browser extension, firewall, or network configuration. We were not able to
reproduce it on our accounts.
[2022/08/11 15:40]
Another user reported the same problem and provided
request details. That allowed to trigger the failing request on demand with
curl. At this point, the investigation focused on the WAF and rate limiting.
[2022/08/11 19:20]
Issue mitigated by rolling out a hotfix to the
production environment.
Remediations
We reconfigured our Web Application Firewall and modified monitoring rules to avoid this kind of issues in the future. We are in contact with our WAF vendor to verify if this class of potential problems can be mitigated to ensure that such an outage will not occur again.