Postmortem Incident Report

Summary

<aside> <img src="/icons/document_gray.svg" alt="/icons/document_gray.svg" width="40px" /> A high-level overview of the incident

</aside>

On 03 January 2023, our jobs were failing for about 3 hours. That caused stale data in all of the daily dashboards.

Our processes started failing because of a faulty Pull Request (add more details here) that was deployed on the same day.

The first job after the deployed change failed, triggering a notification in Slack. The team acknowledged the problem and started working on it immediately.

Timeline

<aside> <img src="/icons/calendar_gray.svg" alt="/icons/calendar_gray.svg" width="40px" /> All timestamps in UTC

</aside>

Date / Time
@January 3, 2023 12:00 PM (GMT+2)	We merged this Pull Request. The change was supposed to add a new feature to our pipelines, but it seems it broke something else. This is the start of the outage.
@January 3, 2023 12:43 PM (GMT+2)	The first job after the release started. It failed immediately and triggered a Slack notification.
@January 3, 2023 1:00 PM (GMT+2)	Bob acknowledged the incident and started working on it.
@January 3, 2023 3:14 PM (GMT+2)	Bob started a Zoom call with Jenni and paired with her on the incident resolution.
@January 3, 2023 3:45 PM (GMT+2)	Bob and Jenni deployed a fix to the broken PR and started and restarted the failed jobs.
@January 3, 2023 2:38 PM (GMT+2)	The jobs finished working—end of the outage.

Takeaways

<aside> <img src="/icons/light-bulb_gray.svg" alt="/icons/light-bulb_gray.svg" width="40px" /> Lessons learned and plans to improve the situation.

</aside>

Monitoring

Our Slack notifications worked very well. They helped us discover the problem before the rest of the company.

We might look for opportunities to set monitoring on more sides of the project and get notified early more often.

Pull Requests Process

No one understood the impact of the faulty Pull Request. It looks like we tend to approve changes in the code without understanding them deeply.

We can do a few things to improve the situation here:

First, we’ll have to ensure we don’t open too large PRs anymore unless there’s no other way. When someone opens a Pull Request that’s too large, we should look for opportunities to split that into multiple smaller pieces.

Second, we must use Pull Requests as an opportunity to communicate better. We must start writing better descriptions when opening Pull Requests and ask more questions when reviewing them. We can even create checklists for those two processes.