Incident Summary:
Last Tuesday, November 4, starting 9:08 AM UTC, our support team received various reports from different users about an error message when starting the Schoolyear Windows application ("exit code 1"). A signification portion of the students were unable to start Schoolyear on their Windows machine and faced a blank screen with a single error popup.
Exit code 1 is shown by an anti-cheat mechanism of our Windows application. During startup, the application does a self-check to verify its files have not been tampered with. As a means of self-protection, it exits as soon as it finds a such a discrepancy, leading to the "Exit code 1" message.
At 9:18 AM, the Schoolyear development team performed a version rollback of the Windows application, reverting the automatic update that was released that night (v3.17.5). The rollback was instant, meaning that when a student started the application at that point, the previous version was started. This resolved the issue and allowed students to start their exam. However, it did mean that the changes in the automatic update from that night were not in effect.
Once the immediate issue was resolved, the Schoolyear development team investigated the issue and released a new version with included the intended changes of the automatic update and a modified version of the self-check (3.17.6). This version was rolled-out throughout the remainder of the day.
Leadup:
On the previous evening, an update was scheduled for all customers. This update had been widely tested both internally and on beta on which it had been in use for more than a week without problems.
This update was a major update, meaning that it required a full replacement of its installation directory.
Fault:
As apposed to the major update of that evening, a patch update had been released the week before (Oct 28, from v3.15.6 to v3.15.11), meaning that only a small portion of the installation directory had to be updated.
Patch updates only require a ~20MB download, while major updates require a ~180MB download. To limit the network impact our application has, we try to use patch updates as much as possible.
The patch update v3.15.11 added a new file the to installation directory, which in itself is a rare event. What was even more rare, was that this file was then immediately removed again in the new update (v.3.17.5). Before the release, the Schoolyear development team tested whether this file was indeed removed when updating from the previous version.
Due to the way our software performs patch and major updates, this file was not removed when a student installed 3.15.6, then updated to 3.15.11 and then updated to 3.17.5. This particular sequence of updates was not tested by the development team.
On the morning of November 4, all Windows users in this particular sequence auto-updated to 3.17.5 before they started their exam. Upon start, the self-check detected the unexpected file and closed itself with "exit code 1".
Impact:
973 students were faced with the "exit code 1" message and were unable to start their exam unless they reinstalled the application manually. For students without admin permissions (e.g., managed devices), had to wait on the version rollback to restart their exam.
Response:
Within 10 minutes of the first issue being reported, Schoolyear reverted to the previous version, after which all impacted students were able to restart their exam.
After finding the root cause of the failure, the Schoolyear team is now developing and testing a mechanism to prevent this issue from ever happening again. Until this mechanism is released, we perform manual tests before each release to cover this edge-case.
Timeline:
- 00:00 - Major update is automatically rolled out to production.
- 9:08 - First user report of problems.
- 9:18 - Faulty update is rolled back.
- 10:10 - We confirm the issue is resolved for all users.
Reflection:
While we perform quite a bit of internal testing on many different hardware and software configurations before releasing an update to production, a test can only cover so many different scenarios. That is why updates are rolled out in stages. Most updates even run in a sandboxed portion of our application were their impact is limited.
Realistically, the only thing that could make this kind of release fail was an edge-case in the update mechanism that wasn't triggered during the staged rollout. Unfortunately, that was exactly what happened.
Looking back at this incident, we are grateful to have invested in the extensive monitoring systems that allowed us to pinpoint the faulty version within 30 seconds and the rapid rollback mechanism that allowed us to roll it back within 10 minutes.