Better late than never: Getting into interpretability in 2025

Before we dive into my thoughts on interpretability, here is how I structured the blog post: It is written as a collection of mostly disjoint commentary and reflections, inspired by seeing Kayo Yin’s recent Substack post. So each section has its own theme but no explicit connection. I hope you enjoy the stream of thoughts!

───

Getting into interpretability a year ago was the best thing that could have happened to me as a scientist and curious human. I do think it might pan out successfully for my actual publications and scientific career, but more importantly beyond that, it is just a ton of fun and feels very fundamental.

It forces me to think more deeply, to learn more actual theory and methodologies, than previous years of my PhD. In retrospect I could and should have forced myself much earlier to really grapple with fundamental questions around these models. I always loved fundamental philosophical and cognitive science questions but I never actually formally engaged with them deeply on a technical level. Now I am almost on a daily basis formulating some hypothesis, testing it, trying to come up with explanations/intuitions together with my collaborators.

A year ago I already had a vague sense I wanted to stay in academia after the PhD. Now I’m certain.

───

I wanted to write this blog post because (mechanistic) interpretability is such a unique subfield in AI. One could write a whole book just about the recent history of the field with its “two cultures” (academics and Anthropic/LessWrong/blog-post folks), flaws, personalities and meta-science questions. I really enjoyed the paper Mechanistic? (Saphra, Wiegreffe). In general when I pivot into a new field of research I appreciate position papers and philosophical takes much more than methods: it gives me a sense of where the field is currently at, and what direction people think the field should approach next. Similarly, my friend Marius had this great paper on empirically measuring the impact of interpretability findings: From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP.

When I started getting into language grounding as a field, and got so excited that I wanted to do a PhD on this topic, it wasn’t because of some paper proposing a new model, dataset or training scheme. No, it was again position papers such as Experience Grounds Language. What I really appreciate with interpretability is that there is so many hard philosophical questions, yet they are often highly technical about concrete model internals. Language Grounding as a field also has such hard philosophical questions, but they usually remain vague around “Where does meaning really come from?” or “What is true language understanding?”. And I think the specific field I had gotten myself into the last years, after not actually working on the grounding question precisely, often had neither going for it: vision-and-language modeling is in practice not that philosophical but often also not that technically rigorous.

To come back: Why did I wanted to write these thoughts down? Yes, it is a unique field to get into and a lot to think about. I haven’t even gotten started with the whole LessWrong/Anthropic vs. traditional academia clashes. But I think 2025 is also an odd year to get excited about the field: many people, e.g. my roommate/labmate, are disillusioned. SAEs are suddenly not in anymore. Big promises around safety of models or improving performance were not really met. And in terms of proper scientific theories we are also mostly still in the dark. It definitely would have been very different to join the field just a couple years ago. But maybe this is now the perfect time to get into it. Great science takes time, maybe decades.

───

The community is great. Without exception, any interpretability researcher I have talked to works on cool topics, has thought deeply about models, and just has this vibe of simply loving science, discovery, ideas, brainstorming with friends over beers (or your beverage of choice).

So many names come to mind that have inspired me, given me new perspectives or provided great arguments why interpretability is important: David Bau, Hadas Orgad, Yonatan Belinkov, Mor Geva, Marius Mosbach, Verna Dankers, Philip Isola, Chris Potts, Sonia Joseph, Taylor Webb, Rim Assouel, Tom Vergara Browne, John Hewitt, Sarah Wiegreffe, Naomi Saphra, Adina Williams, many more.

People in this field are kind. They are exceptionally open-minded about new ideas and don’t shut them down when it isn’t immediately obvious how this would lead to better models. People are genuinely curious how things work. When I reflect what I look for in good collaborators in general, I find these traits most often in interpretability these days: you come out of meetings with them feeling excited and with momentum; interpretability teaches you to critically question a lot assumptions and details, so they will also do a great job poking your ideas with simple questions; they often know the nitty-gritty details of how models are built but can also talk big picture.

So when putting aside how cool the science side is, sometimes in life it is best to judge if something suits you by the people that are in it and how it makes you feel at the end of the day; whether it is friend groups, companies, choosing a PhD program. Go with good vibes, even if there isn’t a perfect topic match. Luckily with interpretability there is both for me.

───

Similarly, I have never been more excited about a project than right now. Often when I discuss it with people, they get excited too or at least have some great questions. I can sense they are engaged and actually thinking about it. That is not the norm in academia. With past projects this hadn’t happen as much. There just wasn’t as much depth to them. This is my first first-author interp project (Update: the paper has now been released here) but my conclusion so far is that interp papers lead to a lot more potential follow-up ideas than many other types of research. There are frankly too many ideas floating around with my collaborators, that I sometimes have a hard time noting them all down. I genuinely think I could write 5-10 follow-up works to this if I had the time (and don’t get scooped of course). With some of my past work, I can remember perhaps 1-3 interesting follow up ideas. But of course, there is a strong confounding factor: As a researcher I have matured and would like to think I can now identify interesting directions much faster than when I started the PhD. At the same time I would claim that interp can speed up such maturity since you have to grapple much more frequently with hypothesis formulation and testing. When scanning for related work, the average interp paper also teaches you scientific thinking more than the average AI paper. So just by reading around for a couple month you absorb a lot of good taste for problems eventually.

───

So far, I have made everything seem rosy. And honestly it really does feel this way sometimes. “Downsides” these days are working too hard since I started caring a lot about research again. Beyond these personal aspects, one downside is that with interp work, similar to many other sciences, you rarely know when to stop a project. On the other hand, when you directly work on methods or datasets there are well-defined targets that would make it a “successful” projects: if you beat the baseline, if your new benchmark shows models struggle, if your collected dataset is large enough and shows models improve. Of course ideally in interpretability you also should have some baselines to beat. But since these baselines usually do not constitute some meaningful downstream application, beating them is not enough. You also have to contribute some neat/cool/novel insight. That can be subjective, and hence you might be unsure whether your cool quirks that you have found in some layer of the LLM are worthy to publish, or just a fun fact to tell friends at lunch.

An other downside is that some people are quite skeptical of interp research and might grill you in interviews and talks. I think that is fine, not everyone has to be excited about the same things. There are fair criticisms and hiring committees often want to see that you at least considered how your insights can eventually improve models and broadly society. In that sense, I am grateful that I am surrounded by skeptical mentors and colleagues who push back sometimes. My supervisor has rightfully often questioned how my findings will ultimately impact downstream behavior of the model or any notion of usefulness. Marius Mosbach (a good friend and postdoc in our lab) has been writing papers and organizing workshops around the impact and actionability of interp. Since we talk a lot, from day one he has ingrained that thinking in me. I also like to push back myself against the pushbacks. Why does something need to be immediately actionable? Does it help anyone if their idea is shut down because there is no path to sota (yet)?

As a result of these discussions, I did not tie myself to any specific “school of thought’” or type of interp methods, and perhaps not even the community itself: interp is a tool in the toolbox, a way of thinking. Broadly we want to understand models, build theories. Whether you call it analysis, inspecting internals, interpretability, science of LLMs, physics of LLMs etc. does not matter too much.

Chris Pott’s wrote a great blog post responding to various criticisms of the interpretability field: Assessing skeptical views of interpretability research (also on YouTube as a talk).

One last thought on impact: Framing one’s interp work as a general new tool or lens, opposed to “merely” insights specific to some model or type of task, can show people that your work might have wider impact. Rightfully, people can become disillusioned if almost all of interp shows some niche behavior in the model. So when your work instead proposes a new tool/framework to analyse models, it might be perceived as more general. A decent analogy here might be the microscope and related technologies, that enabled far wider impact than specific findings given some existing tool. Of course it easier said than done to invent the next microscope, and let’s see if the tool I propose in my current ongoing work is any good.

───

I also want to take some time to go into more technical details: Great papers or ideas I have come across. What is the field, or the community around me, thinking about in 2025? I am only getting started myself so this is not meant to be a holistic overview. Just my personal highlights of ideas I have come across.

Actionable Interpretability: When you have an insight, what should others do with it? What actions should they take? As I wrote earlier, people nowadays are not content anymore with mere insights. One of my lab friends Marius Mosbach has been heading some efforts to quantify and define this impact or actionability of interp insights. Recently at COLM a talk from John Hewitt had a similar message: Our goal is to do good post-training/alignment. In his own words from a slide:
“Research in behavior-internals interplay is just posttraining/alignment research with a big bet that in the long-term, methods that leverage explicit model decomposition will work best.”

I push back in conversations: This is making new researchers in the field (like me) pressured to immediately pull some fancy applications out of thin air! I do agree, we should be aware of how others should use our work. But sometimes an action others can take after reading my paper is just to update their mental model of how AI systems work. Or an action can be that this paper is one more piece in the big puzzle of some phenomena: If now 5 papers report such results, maybe there is a larger pattern!

So, let’s be open-minded what we consider actions in interp. Let’s give ideas time to develop before they have to turn into safety or finetuning methods. And let’s lead with curiosity first in many of the conversations we have as a field.

───

The right level of abstraction for interp: This is something I wish I had smarter thoughts on. I know other people have very smart thoughts on this. For example, Dan Hendrycks wrote The Misguided Quest for Mechanistic AI Interpretability, where this paragraph gives a good intuition:
Complex systems cannot easily be reduced to simple mechanisms. As systems become larger and more complex, scientists begin focusing on higher-level properties — such as emergent patterns, collective behaviors, or statistical descriptions — instead of attempting direct analysis at the smallest scale of fundamental interactions. Meteorologists do not try to predict the weather by tracing every single molecule in the atmosphere, for instance. Similarly, it would be intractable to understand biological systems by starting from subatomic particles and working up from there. And few psychologists attempt to explain a person’s behavior by quantifying the contribution of every neuron to their thoughts.

So maybe individual neurons are too low-level for interp theories.

To connect this to AI, Hendrycks argues that AI models can only solve complex tasks if they have complex inner workings, that are hard to simplify. Similar to humans, we also often can’t explain precisely all our reasoning and decision making. Another good example from his post:
Discussing explainable AI in a 2018 interview, Geoffrey Hinton put it like this: “If you put in an image, out comes the right decision, say, whether this was a pedestrian or not. But if you ask “Why did it think that?” well if there were any simple rules for deciding whether an image contains a pedestrian or not, it would have been a solved problem ages ago.”

So if we were to simplify models (e.g. compress them into simpler rules), we might get better interp but lose out on a model that can handle edge cases. And edge cases are often the most interesting cases! And also in some domains quite common.

Here is Dan’s main point summarized:
Interpretability should start from higher-level characteristics. I have been pointing out limitations in interpretability for years. This is not to say we should not try to understand AI models at some level. Rather, recognizing them for the complex systems they are, we should draw inspiration from the study of other such systems. Just as meteorologists, biologists, and psychologists begin by studying higher-level characteristics of their respective subjects, so too should we take a top-down approach to AI interpretability, rather than a bottom-up mechanistic approach.

I am aware I am quoting a lot here, but it is hard to put it better than the article itself!

So Dan’s suggestion concretely is to focus on “representations” and not neurons or circuits as the main level of abstractions, and he points to their work from 2023 “Representation Engineering: A Top-Down Approach to AI Transparency”.

My own current project also operates on that level, so that’s not too bad!

Switching gears, a bit of history: the most well-known framework for levels of abstraction seems to be Marr’s 3 levels of analysis. This framework has been suggested to me several times, so here is a TLDR:

You can analyze an information processing system on the level of either computation, algorithm or implementation. Implementation is the easiest to explain: what is the exact hardware or low-level implementation, e.g. the exact neurons. Computational and algorithmic seem a bit more related to me at least. Computational cares more about input-output and less about the in-between. What kind of function is it broadly (i.e. the signature) and is it e.g. a bijective mapping? Algorithmic would look at the implementation such as analysing multi-head attentions.

If I remember correctly, Philip Isola was arguing for focusing more on the highest level (computational) to understand AI systems better, during a broad chat we had about the interpretability field.

Where am I going with this? I don’t have a great answer, nobody has, what the right level of abstraction is. Like many others, I also feel like neurons are not it. I found embedding space analyses quite fruitful for insights recently in my own work. I know too little about circuit discovery to make a strong statement but the current concrete techniques to discover them with minimally different prompts of the same length feel too narrow. I doubt we can find clean circuits for many interesting complex skills.

───

Multimodal Interpretability: Interpreting multimodal models, such as LLMs that process images or video input, has been my recent focus and I want to make a case here why I think it’s a great test-bed for interpretability, and outline the many fascinating interp questions that only arise when two (or more) modalities mix.

Of course I am somewhat biased, having worked on vision+language for now five years, but I genuinely think language is only a part of the bigger picture of intelligence and thus interpretability. Language is neat and clean, almost by definition interpretable, and allows for studying many high-level cognitive phenomena such as theory-of-mind, knowledge or reasoning. So it makes only sense why a lot of fruitful interpretability has happened there. But there is so much more out there!

Sonia Joseph laid out a good argument for why multimodal interpretability should not just be an afterthought after LLMs, hidden deep in the LessWrong forum comment section (a common situation in the interp field, science happening in forums). It is quite long but I want to include it fully here. If this comment piques your interest, it is part of a larger philosophical debate in this section around interp on different modalities (featuring comments from the famous Logitlens author Nostalgebraist). So here is Sonia’s comment:

There is sometimes an implicit zeitgeist in the mech interp community that other modalities will simply be an extension or subcase of language.

I want to flip the frame, and consider the case where other modalities may actually be a more general case for mech interp than language. As a loose analogy, the relationship between language mech interp and multimodal mech interp may be like the relationship between algebra and abstract algebra. I have two points here.

Alien modalities and alien world models

The reason that I’m personally so excited by non-language mech interp is due to the philosophy of language (Chomsky/Wittgenstein). I’ve been having similar intuitions to your second point. Language is an abstraction layer on top of perception. It is largely optimized by culture, social norms, and language games. Modern English is not the only way to discretize reality, but the way our current culture happens to discretize reality.

To present my point in a more sci-fi way, non-language mech interp may be more general because now we must develop machinery to deal with alien modalities. And I suspect many of these AI models will have very alien world models! Looking at the animal world, animals communicate with all sorts of modalities like bees seeing with ultraviolet light, turtles navigating with magnet fields, birds predicting weather changes with barometric pressure sensing, aquatic animals sensing dissolved gases in the water, etc. Various AGIs may have sensors to take in all sorts of “alien” data that the human language may not be equipped for. I am imagining a scenario in which a superintelligence discretizes the world in seemingly arbitrary ways, or maybe following a hidden logic based on its objective function.

Language is already optimized by humans to modularize reality into this nice clean way. Perception already filtered through language is by definition human interpretable so the deck is already largely stacked in our favor. You allude to this with your point photographers, dancers, etc developing their own language to describe subtle patterns in perception that the average human does not have language for. Wine connoisseurs develop vocabulary to discretize complex wine-tasting percepts into words like “bouquet” and “mouth-feel.” Make-up artists coin new vocabulary for around contouring, highlighting, cutting the crease, etc to describe subtle artistry that may be imperceptible to the average human.

I can imagine a hypothetical sci-fi scenario where the only jobs available are apprenticing yourself to a foundation model at a young age for life, deeply understanding its world model, and communicating its unique and alien world model to the very human realm of your local community (maybe through developing jargon or dialect, or even through some kind of art, like poetry, or dance, communication forms humans currently use to bypass the limitations of language).

Self-supervised vision models like DINO are free of a lot of human biases but may not have as interpretable of a world model as CLIP, which is co-optimized with language. I believe DINO’s lack of language bias to be either a safety issue or a superpower, depending on the context (safety in that we may not understand this “alien” world model, but superpower in that DINO may be freer from human biases that may be, in many contexts, unwanted!).

It seems that I am not the only getting excited about multimodal interpretability. One positive indication was the large number of popular multimodal interpretability papers at COLM 2025. Hidden in plain sight: VLMs overlook their visual representations received an Outstanding Paper Award while Interpreting the linear structure of vision-language model embedding spaces received a Spotlight.

Some concrete interesting RQs I want to provide here to show that multimodal interp opens up entirely new questions:

On a meta-science level: Do tools and insights (e.g. Logitlens, circuits, …) developed primarily with text-only data also hold in perceptual and embodied settings?
Which concepts are encoded cross-modally somewhere inside the model, e.g. neither in text nor vision form but some modality-agnostic form? Where in the model exactly does this happen and what is the process for “merging” visual and textual concepts?
For models that learn vision and language from scratch together such as Chameleon, we hope to see synergies between modalities. Is that actually the case? At what points during training do linguistic or perceptual skills emerge? Do they share circuits?
While CoT reasoning has made large progress, visual reasoning is still in its infancy. How can we interpret reasoning that might not happen anymore in textual form, such as continuous reasoning or in pixel-space?

Trends:

I have not been long enough in the field to make strong statements about trends. But I do find the field is maturing a little after so much criticism over the years. People are often self-aware about actionability, weak baselines, correlation vs. causation etc.

Some trends that are coming or going: SAEs were hot until the summer this year when e.g. Neel Nanda wrote that they were deprioritizing them. Circuit discovery might have, as mentioned above, a too narrow toolset and a more senior researcher recently said that apparently they were not that popular anymore. Maybe cleverly designed probes will make a comeback? The conclusion for people who are in this for the long haul should be that trends come and go. Let’s not stick to a single trendy method too much, be open-ended and consider simple baselines. I think it’s good to know the “history” of the field, how probes and attention maps started and were perceived over the years. I certainly have a lot more reading to do myself.

───

Final thoughts: Working on interpretability instills a mindset of always asking about the Why, the How and critically questioning deeply held assumptions by the community. In essence: the scientific method and curiosity. It’s not a perfect field but I think it is the right mindset to understand AI systems in the coming decades. The journey has just begun.

Final final addendum: Among the many inspiring interpretability researchers David Bau really captures the spirit I am trying to convey in my writing here. He recently gave a talk at the MechInterp workshop at NeurIPS 2025, summarized in the blog post In Defense of Curiosity.