Incident Postmortem on 01-02-2024 – Schoolyear help center

Incident summary

Between 10:17:25 and 10:52:40 CEST, the Schoolyear Secure Browser platform experienced a critical server outage. Students were unable to start their exam and on some exam platforms students were unable to continue their exam or hand in their submission.

This incident was caused by a manual database operation executed on the wrong database server causing it to become overwhelmed.

Seconds after the incident started, the operational team was made aware of the problem and began resolving the incident as follows:

Initial investigation: 10:18
Presumed root cause resolved: 10:19
Recovery work begins: 10:25
Communication to customers: 10:27
Database rebuild: 10:42-10:52

This incident was handled with our highest priority (impact 1) as it affected the functionality of our platform for all users.

Lead-up

At 10:15 CEST one of our engineers executed an analytical database query on one of our servers. This query was executed to gather usage statistics for routine operational work. Such analytical queries are performed ad-hoc and are not optimised for performance. Analytical queries may take anywhere from 20 seconds up to 30min+ during which a database server may under a lot of stress (>70% utilisation).

To not impact our normal production traffic, such analytical queries are commonly executed on replica databases: servers that synchronise with the main database. These replicas may go down at any moment without impacting the service, making them well-suited for ad-hoc operational work.

Access to such replica servers is generally available within the Schoolyear team to those that require them for their day-to-day work. Access to the main database, however, is restricted to only a handful of engineers responsible for database maintenance: those who cannot perform their duties without such access. At 10:15 CEST it just so happened that one of these engineers mistakenly used his connection to the main database to perform an analytical query instead of using the one to a replica server.

Fault

After two minutes of ever increasing resource consumption by the misplaced analytical query, the main database was unable to keep up with the normal traffic and at 10:17:25 CEST normal operations started to timeout. Queries that were supposed to take <10ms were taking >2s.

At 10:18:05 CEST the mistake was caught and the offending analytical query was stopped. However, in the 30 seconds between the first timeout and the cancellation, the queue of pending requests grew to roughly 5k. After the cancellation of the offending query, the system was not able to recover because of two complementing reasons:

The queue kept growing at a rate of ~150 req/s as expected during the ongoing examination season.
Any connection to the database that timed out, needs to be reset before it can be used again. This put extra pressure on an already overwhelmed database.

During recovery, both automated systems and subsequently the engineering team, tried to reduce load on the database by throttling incoming traffic. However, this only increased the number of reties fired at our backend.

Impact

Between 10:18-10:52 CEST, the Schoolyear Secure Browser backend was mostly unavailable and completely unavailable between 10:42-10:52 CEST. Students were unable to start their exam, teachers were unable to plan or edit exams and test platforms were unable to verify if students were using the Schoolyear browser. This resulted in some students being unable to hand in their exam.

Detection

The incident was detected seconds after the first requests started failing and engineers sprung into actions 30 seconds later. Our logging systems did not fail us.

Response

After the immediate response of the engineering team, we began to inform our customers of the ongoing incident. We updated our status page and sent out an email to our emergency contacts.

While informing customers we discovered a bug in the third-party status page tool we use to communicate to our customers (OneUptime). During the outage, the status page did not load. This tool is validated to not share any dependency with our infrastructure, specifically to make sure the tool was available during an incident at Schoolyear.

Even though we throughly tested this tool, it failed in its moment of truth. We no longer trust this tool to do its job and it will be replaces promptly.

Recovery timeline

After multiple attempts to recover the service failed due to the so called "Thundering Herd Problem". The engineer team decided to do a full rebuild of the database infrastructure from a point-in-time backup:

10:41 CEST: Drain all traffic from the backend. All requests were dropped at the edge of our infrastructure and we started to only respond with HTTP 503.
10:42 CEST: Shutdown of the database servers.
10:42 CEST: Rebuild of new database servers from point-in-time backups begins.
10:51 CEST: Database rebuild completed.
10:52 CEST: Incoming traffic is allowed back into the backend. Due to cold caches in the database, our average latency is 2.3s. This is far above the targeted 10ms, but enough for students to resume their exam.
11:01 CEST: Last timeout warning is recorded.

Reflection

We've identified multiple complementing failures that cause the incident, therefore we have taken the following actions:

Failure	Immediate action	Follow-up action
Analytical query overwhelms main database	Colour coding in the maintenance tool to distinguish between the main database and the replica.	Configure resource consumption limit for maintenance engineers on the main database.
Main database unable to recover	Increase database capacity	Evaluation & training of recovery plans in case of a Thundering Herd Problem and building of a simulation tool for future automated testing.
Clients cause Thundering Herd traffic during outage	Client update with a specific fix	Update code review policy
Status page unavailable during outage	Report bug to third-party	Replace status page provider