The VOID

Courtney Nash

The Verica Open Incident Database (VOID) makes public software-related incident reports available to everyone, raising awareness and increasing understanding of software-based failures in order to make the internet a more resilient and safe place. This podcast is an insider's look at software-related incident reports. Each episode, we pull an incident report from the VOID (https://www.thevoid.community/), and invite the author(s) on to discuss their experience both with the incident itself, and the also the process of analyzing and writing it up for others to lean from. read less
TechnologyTechnology

Episodes

Episode 6: Laura Nolan and Control Pain
Apr 25 2023
Episode 6: Laura Nolan and Control Pain
In the second episode of the VOID podcast, Courtney Wang, an SRE at Reddit, said that he was inspired to start writing more in-depth narrative incident reports after reading the write-up of the Slack January 4th, 2021 outage. That incident report, along with many other excellent ones, was penned by Laura Nolan and I've been trying to get her on this podcast since I started it. So, this is a very exciting episode for me. And for you all, it's going to be a bit different because instead of just discussing a single incident that Laura has written about, we get to lean on and learn from her accumulated knowledge doing this for quite a few organizations. And she's come with opinions.  A fun fact about this episode, I was going to title it "Laura Nolan and Control Plane Incidents," but the automated transcription service that I use, which is typically pretty spot on (thanks, Descript!), kept changing "plane" to "pain" and well, you're about to find out just how ironic that actually is... We discussed: A set of incidents she's been involved with that featured some form of control plane or automation as a contributing factor to the incident.What we can learn from fields of study like Resilience Engineering, such as the notion of Joint Cognitive SystemsOther notable incidents that have similar factorsWays that we can better factor in human-computer collaboration in tooling to help  make our lives easier when it comes to handling incidentsReferences:Slack's Outage on Jan 4th 2021A Terrible, Horrible, No-Good, Very Bad Day at SlackGoogle's "satpocalypse"Meta (Facebook) outageReddit Pi-day outageIronies of Automation (Lissane Bainbridge)
Episode 2: Reddit and the Gamestop Shenanigans
Dec 1 2021
Episode 2: Reddit and the Gamestop Shenanigans
At the end of January, 2021, a group of Reddit users organized what's called a "short squeeze." They  intended to wreak havoc on hedge funds that were shorting the stock of a struggling brick and mortar game retailer called GameStop. They were coordinating to buy more stock in the company and drive its price further up.In large part, they were successful—at least for a little while. One hedge fund lost somewhere around $2 billion and one Reddit user purportedly made off with around $13 million. Things managed to get even weirder from there, when online trading company Robinhood restricted trading for GameStop shares and sent its values plummeting losing three fourths of its value in just over an hour. But that's less relevant to this episode. What matters is that while all this was happening, traffic to a very specific page on Reddit, called a subreddit, r/wallstreetbets went to the moon. Long after the dust had settled, and the team had a chance to recover and reflect, some of the engineers wrote up an anthology of reports based on the numerous incidents they had that week. We talk to Courtney Wang, Garrett Hoffman, and Fran Garcia about those incidents, and their write-ups, in this episode.A few of the things we discussed include:The precarious dynamic where business successes (traffic surges based on cultural whims) are hard to predict, and can hit their systems in wild and surprising ways.How incidents like these have multiple contributing factors, not all of which are purely technicalHow much they learned about their company's processes, assumptions, organizational boundaries, and other "non-technical" factorsHow people are the source of resilience in these complex sociotechnical systemsCreating psychologically safe environments for people who respond to incidentsTheir motivation for investing so much time and energy into analyzing, writing, and publishing these incident reviewsWhat studying near misses illuminated for them about how their systems workResources mentioned in this episode include:Reddit's r/wallstreetsbets incident anthology, which links to all the reports we discuss."Work as imagined and work as done" by Steven Shorrock (video)