Post-Mortem: Cache Key Corruption on 2022-12-20

Aeneas Rekkas - Last updated at December 20, 2022

Time to read: 2 min

A multi-cookie request managed to cause a cache key corruption on our end:

  • When a user A was logged in across two or more projects at the same time, two or more cookies were being sent to Ory’s API in a single request. This caused a cache key corruption due to an incorrect JavaScript assertion.
  • Due to a code bug, a wrong cache key was calculated, and user A’s session was cached as the “unauthorized” (anonymous) default response.
  • That request was served by Cloudflare Workers. The specific regional cache worker has its cache poisoned. We assume a ratio of > 10 workers per data center location (e.g. Frankfurt).
  • A separate HTTP client without any session cookie, or with also two or more session cookies hits the /session/whoami endpoint and the exact same Cloudflare Worker that has a poisoned cache.
  • Subsequently, anonymous request hitting the same CloudFlare edge worker would return user A’s session.

The cache key corruption also happened to projects which had the cache feature flag not enabled. This happened because we rolled out a feature to catch unauthorized session at the edge and prevent them from hitting Ory’s backend network. We did this due to massive traffic influx which happened when a customer rolled out an incorrect retry mechanism to their code base, hitting Ory Servers at 100 RPS.

Timeline

  1. 2022-12-20 15:28 CET: Ory pushes a configuration change to production
  2. 2022-12-20 18:46 CET: Customer makes Ory aware that there is an issue
  3. 2022-12-20 18:48 CET: Patrik Neu acknowledges the issue as first-responder
  4. 2022-12-20 18:58 CET: The Ory team jumps on a video call to identify the issue
  5. 2022-12-20 18:59 CET: The configuration change is rolled back
  6. 2022-12-20 19:00 CET: The customer confirms that the issue is gone
  7. 2022-12-20 19:01 CET: The investigation of the root cause is being started in collaboration with the customer
  8. 2022-12-20 20:10 CET: The Ory team has identified what caused the incident

Code bug

export async function extractSession(request) {
  // Is it a cookie?
  const cookie = parse(request.headers.get("Cookie") || "")
  const cookie_name = Object.keys(cookie).filter((k) =>
    k.startsWith(COOKIE_PREFIX),
  )

  // Explanation:
  //
  // when more than one cookie is found, cookie_name is an array
  // and not a string.

  if (request.headers.has("X-Session-Cookie")) {
    return request.headers.get("X-Session-Cookie")
  } else if (cookie[cookie_name] != null) {
    // Explanation:
    //
    // cookie[cookie_name] incorrectly returns undefined when an array with 2 items is found,
    // causing the cache key corruption.
    //
    // More interestingly, despite `cookie_name` being an array, the following code
    // returns the correct item when the array has only one element.
    return cookie[cookie_name]
  }

  return null
}

Impact

Customers of Ory’s clients appeared to be logged in as one of the client’s development team members on individual requests, during a 3 1/2 hour time window.

To be impacted by the issue, the following conditions had to be met:

  1. Multiple Ory projects set up with overlapping cookie domains, e.g. customer.com and staging.customer.com
  2. A user logged into both domains using cookies issuing requests to /sessions/whoami endpoints
  3. Subsequent requests without cookies sent Ory APIs and hitting the same CloudFlare edge worker as in step (2).

Mitigation

Ory rolled back the change, removing the offending behavior.

Preliminary next steps

Ory is defining process and implementation changes to rule out similar incidents in the future:

  • Move all edge worker code from JavaScript to TypeScript to avoid type-based code issues.
  • Expand test suite to cover cases where clients behave in unexpected ways, incl. sending multiple credentials (cookies, tokens).
  • Introduce a new process to test cache configuration changes extensively before rolling them out to production.
  • Remove the concept of a cache no-session response, returning a statically defined response instead.
  • Respect all possible credentials when computing the cache key, not just the one found first.
  • Significantly expand the test suite to cover possible edge cases as well as malicious clients.