Incident Postmortem on 26-6-2023 – Schoolyear help center

Incident summary

Between 10:12-10:36 CEST, the Schoolyear Secure Browser platform experienced a server outage. Most of the ongoing exams went on uninterrupted, as by design. However, no new exams could be created, edited or started.

This incident was caused by a misconfiguration of our auto-scaling system. The incident was detected by our monitoring system and the engineer on call began to work on resolving the incident as follows:

Starting the investigation of the root cause at 10:14
Restarting all application servers at 10:14 and again at 10:35-10:36
Performing a full database fail-over at 10:15

This incident was handled with our highest priority (impact 1) as it affected the functionality of our platform for all users.

Leadup

On 26th of June 2023, 2 minutes before the incident, a uniquely high number of ping requests led to our auto-scaling configuration dramatically increasing the number of server instances. This was then followed by a second wave of requests that required database connection.

This sudden surge led to an excessive demand for database connections, causing the connection pool to fail and the platform to become unavailable for our users.

Fault

The outage was caused by our auto-scaling configuration. A large surge of ping requests, followed by a surge of requests requiring a database connection, led to an overwhelming number of new connections to the connection pool. This led to the connection pool failing, causing the platform to become unavailable.

Impact

Between 10:12-10:36, the Schoolyear Secure Browser platform was unavailable for any new exams to be created, edited, or started. Furthermore, users were unable to log in, check student logs in the dashboard, restart individual students, and test applications were unable to verify if a student used the Secure Browser.

However, it's important to note that this incident had minimal impact on students who had already started their exams, in line with the platform's design to ensure ongoing exams can continue uninterrupted during such events.

Detection

The incident was detected when our monitoring system reported an elevated failure level and our support staff received several calls and an emails from users unable to login. Upon investigation, the engineer on call traced the failures to the "connection pool" between the application servers and the database cluster.

Response

After receiving the alert at 10:14, the engineer on call immediately started the investigation. The decision was made to restart all application servers and perform a full database fail-over. The application servers were then shut down to prevent them from trying to initiate connections to our database pool and overwhelming our servers with requests.

After the database fail-over was completed at 10:34, the application servers were restarted. However, due to the backlog of events, our servers were once again overwhelmed. We decided to restart all application servers again, but this time at a more gradual pace, increasing our capacity over a timespan of 2 minutes.

Recovery

After adjusting the auto-scaling configuration and allowing our capacity to gradually increase, the service was restored and everything returned to normal by 10:36.

Timeline

All times are CEST 26th of June 2023

10:12 - Requests to our servers start failing.
10:14 - Our monitoring system reports the elevated failure level to the engineer on call.
10:14 - Engineer on call traces the failures to the "connection pool" and decides to restart all application servers.
10:15 - Decision made to perform a full database fail-over and shutdown all application servers.
10:34 - The database fail-over is completed and application servers are restarted.
10:35-10:36 - Decision made to restart all application servers again, but at a more gradual pace.
10:36 - Everything back to normal.

Reflection

The outage was caused by our auto-scaling configuration. Most of the incoming requests, e.g. ping requests that check internet connectivity, do not require a database connection. However, a large surge of such requests do increase the number of application servers that are spun-up.

Yesterday morning, we had a uniquely high number of such ping requests, directly followed by a uniquely high number of requests that did require a database connection. The first wave of ping requests caused our auto-scaling to increase the number of servers dramatically. The second wave of requests were therefore spread out over many servers (~1.5k) that all needed a separate database connection. The second wave of requests caused a spike in new connections to the connection pool, causing it to fail.

Moving forward, we've made a change to our auto-scaler to limit the number of servers it creates during traffic spikes, allowing us to handle larger traffic surges and gradually degrade our service rather than causing a general outage. This will enable our development team to respond before the issue becomes widespread.

As a part of the solution, in the case where there are more incoming requests than available database connections, we now share the number of available database connections. The changes will improve our ability to handle similar incidents in the future.

Finally, we plan to further improve our stress testing for unique scenarios on our servers. This will help us spot potential bottlenecks beforehand and put in place measures to stop similar issues from happening in the future.

Communication

We will also improve our communication procedures during incidents of this nature. Our support staff received several calls and an emails during the outage. In such situations, besides the status updates email list, we will additionally create a public online status page where all users are simultaneously informed.

In case the support team isn't able to respond due to the number of incoming calls, we will set up an automated message on our support line during a full outage, acknowledging the issue and referring callers to our status email and status page.