This demo accompanies the paper "LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs". We train connector-only VLMs (frozen LLM + frozen vision encoder) and analyze interpretability of visual tokens at each layer using three methods:
Click on any model cell below to explore per-patch interpretability across layers.
| LLM \ Vision Encoder | ViT-L/14-336 | DINOv2-L-336 | SigLIP |
|---|---|---|---|
| LLaMA3-8B |
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
| OLMo-7B |
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
| Qwen2-7B |
View Results
LN: 9 | LL: 14 | NN: 11
|
View Results
LN: 9 | LL: 14 | NN: 11
|
View Results
LN: 9 | LL: 14 | NN: 11
|
Exploring the impact of training variations, random seeds, and off-the-shelf models.
| Model / Variation | Details | Results |
|---|---|---|
| Qwen2-VL (off-the-shelf) | 10 images | View Results |
| Seed 10 | 10 images | View Results |
| Seed 11 | 10 images | View Results |
| Linear Connector | 10 images | View Results |
| Unfreeze LLM | 10 images | View Results |
| First-Sentence Captions | 10 images | View Results |
| Earlier ViT Layer (6) | 10 images | View Results |
| Earlier ViT Layer (10) | 10 images | View Results |
| TopBottom Task | 10 images | View Results |
| TopBottom + Unfreeze | 10 images | View Results |