Incident Summary
(All times are in Central European Time)
- The specific bug in our LTI integration that caused an exception in our server. A lack of test coverage for LTI-based integrations was determined to be the root cause. This bug should have been caught before it reached production.
- The fact that an exception in a single request can cause a server to crash. The growing complexity of our exception handling was determined to be the root cause.
Leadup
The day before the incident (21-05), we released a change to the public API to beta, used by 3rd parties to integrate with Schoolyear. We maintain a few integrations ourselves, such as our LTI integration. This LTI integration update contained a bug.
After a day in beta, it was promoted to production via a rolling update, following our standard operating procedure.
Fault
The LTI integration bug caused an exception whenever a student started an exam through our LTI integration. While this prevented LTI users from starting their exam, the impact was made much bigger by a problem in our exception handling. This exception caused the server to crash, our load balancer to move traffic to the remaining servers, and our orchestrator to restart the crashed server.
Every time a server crashed, the traffic routed to it failed until the load balancer rerouted the traffic. This caused bursts of failing traffic.
Impact
Timeline
- 13:24 – Faulty update pushed to production
- 13:26 – First crash was triggered
- 13:28 – First alert is triggered
- 13:34 – Rollback started, number of failed requests starts reducing
- 13:39 – Rollback completed, full availability restored
Reflection
This is the second incident in one month where both root cause investigations point to a lack of test coverage. While we are improving our test coverage for LTI and simplifying our exception handling system, we are also looking at our testing and release practices more broadly. Both incidents were triggered by bugs in very specific edge cases that were hard to replicate and were at the interface between Schoolyear and 3rd party systems.
We are reviewing our testing practices for 3rd party system integrations and developing ways to simulate these edge cases before they are released.