Post-Mortem: Ory Network Outage caused by a massive DDOS attack
A post-mortem for the Ory Network Outage caused by a massive DDOS attack on 30/10/2023.
Incident Report: DDOS Attack
Dear Customers and Users,
We’d like to share a post-mortem about the event on October 30th. We are sincerely sorry for the inconvenience caused by the incident and share this write-up to describe actions we have taken to prevent such an incident in future.
Incident Summary
Incident 1: 30.10.2023 13:43 CET to 30.10.2023 14:42 CET
Incident 2: 30.10.2023 16:03 CET to 30.10.2023 17:19 CET
Services affected:
- Login & Identities APIs
- OAuth2 & OIDC APIs
- Account Experience UI
At 13:43 CET on October 30, 2023, our SRE team was alerted to multiple DDOS attacks. Although our security solution's automated response blocked a limited number of malicious requests, the overwhelming volume of requests led to increased service response times between 13:47 CET and 16:03 CET. Following an analysis of the DDOS attack, the SRE team modified the DDOS attack protection settings and created WAF rules to counter these attacks. In addition to the changes in the DDOS protection setup, adjustments were made to the auto-scaling configuration and node types. By 17:19 CET, Ory Network was fully operational again.
Root Cause Analysis
The existing DDOS prevention and rate-limiting system configuration were inadequate in reliably detecting the DDOS requests. Initially, this caused a rise in service latency and subsequently overloaded the services. Auto-scaling responded by increasing our service instances, resulting in a brief service recovery. Nevertheless, the overwhelming volume of requests led to latency surges once more, ultimately congesting all services.
Resolution and Recovery
Our Site Reliability Engineering (SRE) team studied the attack method and traffic, refining both the infrastructure configuration and the WAF/DDOS configurations to enhance bandwidth and sensitivity. Customized WAF rules were created to obstruct the DDOS attack by targeting its key attack characteristics.
Preventive Measures
- Redefined infrastructure size and set up to provide more bandwidth for services (implemented)
- Redefined infrastructure auto-scaling mechanism (implemented)
- Additional traffic blocking rules (implemented)