Episode 1: Honeycomb and the Kafka Migration


Nov 1 2021 • 31 mins

"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be."

In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company. In this episode, we chat with two engineers involved in these incidents, Liz Fong-Jones and Fred Hebert, about the backstory that is summarized in this meta-analysis they published in May.

We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including:

  • Complex socio-technical systems and the kinds of failures that can happen in them (they're always surprises)
  • Transparency and the benefits of companies sharing these outage reports
  • Safety margins, performance envelopes, and the role of expertise in developing a sense for them
  • Honeycomb's incident response philosophy and process
  • The cognitive costs of responding to incidents
  • What we can (and can't) learn from incident reports

Resources mentioned in the episode:

Published in partnership with Indeed.