Post-Mortem: Outage on 2023-07-18

Ferdynand Naczyński -

Time to read: 1 min

Incident Report: Deployment deadlock

Dear Customers and Users,

We’d like to share a post-mortem about the event on July 18th, related to Production release that failed to deploy seamlessly. We are sincerely sorry for the inconvenience and disruption, and share this write-up to describe actions we have taken to avoid this failure mode in future.

Incident Summary

Incident Start: 2023.07.18 13:50 CEST

Incident End: 2023.07.18 14:06 CEST

Services Affected: Ory network APIs, Ory network console, Ory network account experience

On 2023.07.18, an incident was triggered by a misconfiguration that was unnoticeably pushed to our git repository. Our infrastructure leverages gitops workflow and ArgoCD to automatically deploy changes, which picked up the faulty configuration. The system has run preliminary operations, rendering old application versions inoperable, while new one could not be deployed. This has resulted in a downtime of the platform.

Root Cause Analysis

The reason of the outage was a missing field in Kubernetes deployment manifest. As the manifest did not follow specification, it could not be deployed.

Resolution and Recovery

Following release engineers received alerts and identified the issue. The faulty deployment manifest has been corrected and released, allowing ArgoCD to deploy latest application version.

Preventive Measures

To prevent this failure mode in future, we have corrected pre-release checks to detect any manifests not meeting their specifications. This safeguards against deployment of incomplete and incorrect infrastructure configuration.