Incident Summary:
Last Friday, January 12, starting from 8:47 AM, our support team received various reports from different users about an error message in the Schoolyear onboarding app during updating and launching. After a brief analysis by the technical team, it was discovered that there was an error in the version number, preventing students from starting Schoolyear. This was classified as an impact level 1 (high) and required immediate attention. A few minutes later, at 9:01 AM, a fix was deployed to production and the problem was immediately resolved.
Leadup:
On the previous evening, January 11, a maintenance update was automatically rolled out to production at midnight. These updates are released several times a week to production and do not require any action from the students.
Fault:
With each update, including maintenance updates, the version number of the Schoolyear client changes. Although the rollout of the updates is fully automated, a manual error occurred in determining the correct version number by one of our engineers during the planning of this update.
Impact:
The incorrect version number resulted in students receiving the wrong and non-functional download link when starting and updating Schoolyear on the morning of January 12. This caused the students to encounter an error message. This error had the highest impact (impact 1 / urgency 1) because no student could start their tests.
Response:
Following the first report, the on-call engineer started an investigation and identified the error within the minute. A fix was then immediately deployed to production.
Timeline:
00:00 - Maintenance update is automatically rolled out to production.
8:47 - First user report of installation problems.
8:55 - On-call engineer traces the problem with the version numbers.
9:01 - Fix is deployed to production and all issues are resolved.
Reflection:
Humans are imperfect, which is why we automate the operation of update rollouts as much as possible. However, at the end of the day, a human must decide if and when an update is released. This is done by configuring a version number and a timestamp. However, we had no automated tests for this version number. To prevent such errors in the future, we have immediately modified the planning system to add this automated test, thereby reducing the impact of human error on the update operations.