Gemma4-12B was released by Google a few weeks ago and it piqued my interest as an interp researcher since it diverges from the mainstream paradigm: it is an encoder-free vision-language model that processes image patches and audio segments directly as tokens, without a separate vision or audio encoder. This raises an interpretability question: what do those non-text tokens represent at each layer of the transformer, do they resemble interpretable text representations?
We apply two complementary probes to every visual/audio token at 12 selected layers (1, 4, 8, 12, 16, 24, 28, 32, 36, 40, 44, 47). LatentLens (Krojer et al., 2026; normalized) retrieves the nearest-neighbor tokens from a large caption corpus in the model's own representation space. I had developed this method recently (presenting at ICML in Seoul) and was curious how well it would work here. Interestingly, compared to our original LatentLens, I had to add a normalization step to account for outlier dimensions that otherwise dominate cosine similarity.1 I also apply the more established LogitLens (nostalgebraist, 2020), which projects the raw hidden state through the unembedding matrix to return ranked vocabulary predictions. While LatentLens retrieves corpus tokens by similarity search, LogitLens reads off vocabulary logits directly; comparing both across layers reveals how the model processes non-text modalities.
Explore the demos or read through some findings below.
We manually inspect a handful of PixMo-Cap examples, examining nearest-neighbor predictions for individual patches at 12 selected layers (1, 4, 8, 12, 16, 24, 28, 32, 36, 40, 44, 47).
| Layer range | LatentLens (norm.) | LogitLens | Key observation |
|---|---|---|---|
| 1–4 | nothing | not working | High raw cosine similarity at L4 may reflect a rogue-dimension artifact |
| 8 | OCR only | not working | First interpretable signal, exclusively on image text |
| 12 | OCR + first semantics | not working | Semantic (non-OCR) content begins appearing |
| 16 | ~50% patches interpretable | not working | Raw cosine similarities spike to >0.99 for LatentLens (unnorm.) — all corpus tokens look identically similar, so nearest neighbors are meaningless. LatentLens (norm.) is unaffected and peaks here. |
| 24–28 | mostly silent | not working | LatentLens loses most signal; only a few OCR patches remain |
| 32 | OCR only | not working | OCR retrieval recovers in LatentLens; non-OCR still silent |
| 36 | partial, lower diversity | starts working | Partial recovery in LatentLens, but less diverse — same few neighbors recur across patches of the same image |
| 40–44 | OCR competitive | leads on semantics; next-token on OCR | LogitLens overtakes LatentLens on general content; on OCR patches LogitLens predicts the next word in image text, not the current one |
| 47 | silent | slightly weaker than L44 | Both methods degrade; last layer optimized for next-token prediction, not patch decoding |
LatentLens (normalized) and LogitLens are complementary: LatentLens provides the main interpretable signal at layers 8–16, while LogitLens takes over at layers 40–44. Layers 24–28 are largely non-interpretable with LatentLens (norm.) — only a few OCR patches remain; LogitLens does not work here. OCR retrieval in LatentLens partially recovers at L32. At layer 16, raw cosine similarities spike to >0.99 for the unnormalized case — every corpus token looks nearly identical to every query token, so nearest-neighbor retrieval is uninformative. Z-score normalization corrects this.
We probe 50 speech clips from LibriSpeech test-clean. Each 40ms audio token is evaluated independently at 12 selected layers. We measure the fraction of clips where any top-5 prediction matches a transcript word.
LatentLens (normalized) peaks at 50% of clips at layer 16 — the same layer where LogitLens produces no correct predictions at all (0%). LogitLens recovers and overtakes LatentLens from layer 32 onward, peaking at 68% at layer 36.
Environmental sounds (ESC-50). We separately probed 50 environmental sound clips from ESC-50 (one per category: crow, rain, chainsaw, …). This is a much more open-ended scenario than language-centric speech that comes with a transcript. We used an LLM judge to evaluate all 700 clip–layer pairs for any semantic connection to the sound category, yielding only a single hit (!) out of 700: “aerodynamics” for the airplane category at layer 40. All other predictions are unrelated vocabulary (“expressive”, “poetry”, “karate”, “surroundings”) with no connection to the playing sound. Neither LatentLens nor LogitLens revealed interpretable non-speech audio representations at any of the 14 probed layers.
When inspecting the demo and listening to spoken words, I was surprised that LatentLens predictions for a token often matched a word spoken earlier in the clip, e.g. a token at t=3s would retrieve “concord” even though “concord” was spoken at t=0.5s. The model appears to stay anchored to earlier content words as context accumulates. Yet the order of retrieved words across tokens still matched the transcript order.
Absolute timing (left) is low: at the peak layer (L16), only 6.4% of tokens predict the word expected at that exact timestamp. Temporal ordering (right) is much higher: when predictions do match transcript words, they tend to appear in the correct sequence (Spearman ρ ≈ 0.78 at L16, 0.67–0.75 at L36–44).
In other words: at layer 16, the model has encoded much of the utterance's vocabulary in the correct sequence, but without precise temporal localization. The ordering finding is robust (n=47/50 clips) but not perfect — ρ of 0.78 means substantial ordering with real scatter.
Three findings appear consistently across both modalities and both probe methods:
If you find this useful, you can cite it as: