AF - 200 COP in MI: Techniques, Tooling and Automation by Neel Nanda

The Nonlinear Library

Jan 6 2023 • 24 mins

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 200 COP in MI: Techniques, Tooling and Automation, published by Neel Nanda on January 6, 2023 on The AI Alignment Forum. This is the seventh post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer I’ll make another post every 1-2 days, giving a new category of open problems. If you want to read ahead, check out the draft sequence here! Motivating papers: Causal Scrubbing, Logit Lens Motivation In Mechanistic Interpretability, the core goal is to form true beliefs about what’s going on inside a network. The search space of possible circuits is extremely large, and even once a circuit is found, we need to verify that that’s what’s really going on. These are hard problems, and having good techniques and tooling is essential to making progress. This is particularly important in mech interp because it’s such a young field that there isn’t an established toolkit and standard of evidence, and each paper seems to use somewhat different and ad-hoc techniques (pre-paradigmatic, in Thomas Kuhn’s language)., Getting better at this is particularly important to enable us to get traction interpreting circuits at all, even in a one layer toy language model! But it’s particularly important for dealing with the problem of scale. Mech interp can be very labour intensive, and involve a lot of creativity and well honed research intuitions. This isn’t the end of the world with small models, but ultimately we want to understand models with hundreds of billions to trillions of parameters! We want to be able to leverage researcher time as much as possible, and the holy grail is to eventually automate the finding of circuits and understanding models. My guess is that the most realistic path to really understanding superhuman systems is to slowly automate more and more of the work with weaker systems, while making sure that we understand those systems, and that they’re aligned with what we want. This can be somewhat abstract, so here are my best guesses for what progress could look like: Refining understanding: In what contexts are current techniques most useful, and where are they misleading? How can we notice misapplications? Which techniques fail in the same way? Eg, what’s up with backup heads, which take over when the main head is ablated How to find circuits?: What are the right mindsets and approaches for finding a novel circuit? Common traps? What tools are best to apply in an unfamiliar situation, and how to interpret their output? I expect this to look like a mix of a refined understanding of current techniques, building great infrastructure, and building practical intuitions and experience. Building a better toolkit: What kind of new, general techniques are out there for really understanding model internals? One of the joys of mech interp is that we have full control over the model’s internals, and can edit whatever weights and activations we want. I expect there’s many ways to use this that no one’s tried! Causal tracing/activation patching from ROME is a great example of this! It’s such an elegant, powerful and generally applicable technique, that I’d just never thought of before Gold standards of evidence: What does it mean to have truly understood a circuit? Are there generally applicable approaches Redwood’s Causal Scrubbing is a solid attempt here, and I’m excited to see how well it works in practice Good infrastructure: The right software and tooling can massively increase the rate of being able to do research - if common operations can be done in a few lines of code and run quickly, you can focus on actually doing research. My TransformerLens li...