Incident Report: Deployment deadlock
Dear Customers and Users,
We’d like to share a post-mortem about the event on July 18th, related to Production release that failed to deploy seamlessly. We are sincerely sorry for the inconvenience and disruption, and share this write-up to describe actions we have taken to avoid this failure mode in future.
Incident Summary
Incident Start: 2023.07.18 13:50 CEST
Incident End: 2023.07.18 14:06 CEST
Services Affected: Ory network APIs, Ory network console, Ory network account experience
On 2023.07.18, an incident was triggered by a misconfiguration that was unnoticeably pushed to our git repository. Our infrastructure leverages gitops workflow and ArgoCD to automatically deploy changes, which picked up the faulty configuration. The system has run preliminary operations, rendering old application versions inoperable, while new one could not be deployed. This has resulted in a downtime of the platform.
Root Cause Analysis
The reason of the outage was a missing field in Kubernetes deployment manifest. As the manifest did not follow specification, it could not be deployed.
Resolution and Recovery
Following release engineers received alerts and identified the issue. The faulty deployment manifest has been corrected and released, allowing ArgoCD to deploy latest application version.
Preventive Measures
To prevent this failure mode in future, we have corrected pre-release checks to detect any manifests not meeting their specifications. This safeguards against deployment of incomplete and incorrect infrastructure configuration.