Incident Summary
At 10:52 CEST, our development team pushed a flawed update to production, causing requests from integrated assessment platforms to fail. This prevented new exams from being scheduled through integrations, students from starting their exams, and supervisors from opening the Schoolyear Exam Dashboard from integrated platforms.
Two minutes after the release was rolled out, an automated alert notified our operations team of the issue. After the first responder verified the problem, our incident response plan was initiated at 10:59 CEST.
At that time, our support team also began receiving reports from multiple customers, confirming that the issue was widespread.
At 11:12 CEST, the decision was made to roll back the changes. The rollback was completed at 11:15. While we were able to immediately verify that the issue was resolved, we continued to monitor our systems and followed up with the customer who had reported the problem to confirm resolution. Once we were satisfied the incident was fully resolved, we closed it on our status page at 11:31 CEST.
The subsequent investigation traced the root cause to a change in the caching mechanism within our authentication logic. This layer caches authentication information for API keys that are actively making requests to our servers, reducing database load and request latency.
The flawed update introduced a second caching layer. Because this layer was not active in our automated testing framework, the flaw was not caught before it reached production.
Leadup
The day before the incident (11-05), we released a second caching layer for our authentication logic to beta. This change passed our automated testing and peer review, and appeared to be working well, as it had already been active on our internal deployment environments. After a day in beta, it was promoted to production via a rolling update, following our standard operating procedure.
Fault
The caching layer introduced in the update functioned as a secondary cache. Any request authenticated with an API key would first check the server's in-memory cache. Only on a cache miss would it fall through to the secondary layer (Redis). However, the mechanism responsible for writing data to this secondary layer was broken. Any server reading from that cache would receive corrupted data, causing it to incorrectly determine that the API key was invalid.
Impact
Most requests to our server that were authenticated with an API key returned a 401 Unauthorized error. As a result, the Schoolyear integration stopped working for many assessment platforms. These integrations are used to schedule new exams, open the Schoolyear settings widget and exam dashboard, and launch students into their exams.
Throughout the incident, these integrations were highly unreliable or completely unavailable.
Timeline
(All times are in CEST)
- 10:52 – Faulty update pushed to production
- 10:54 – Automated alert notified first responder
- 10:59 – Issue verified; incident response initiated
- 11:12 – Rollback started
- 11:15 – Rollback complete; issue resolved
- 11:31 – Incident closed on status page
Reflection
The root cause investigation revealed a significant gap in our automated test suite coverage. These tests run on every proposed change and serve as a mandatory acceptance criterion, guarding against regressions and unintended breakage of existing features.
Over the past few months, we have been adopting new caching techniques and expanding our test coverage to match. However, it is now clear that the recently introduced 2-layer caching mechanism was not adequately covered by our test suite, which allowed this bug to pass through undetected.
We have already begun work to extend our test suite to cover both layers of our caching infrastructure, simulating production conditions as closely as possible. In the meantime, we have disabled the caching mechanisms that lack sufficient test coverage.