This demo accompanies the paper "LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs". We train connector-only VLMs (frozen LLM + frozen vision encoder) and analyze interpretability of visual tokens at each layer using three methods:
Click on any model cell below to explore per-patch interpretability across layers.
| LLM \ Vision Encoder | ViT-L/14-336 | DINOv2-L-336 | SigLIP |
|---|---|---|---|
| LLaMA3-8B |
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
| OLMo-7B |
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
View Results
LN: 9 | LL: 15 | NN: 15
|
| Qwen2-7B |
View Results
LN: 9 | LL: 14 | NN: 11
|
View Results
LN: 9 | LL: 14 | NN: 11
|
View Results
LN: 9 | LL: 14 | NN: 11
|
Pre-trained VLMs analyzed without connector-only training. These models were trained end-to-end by their respective teams.
| Model | Details | Results |
|---|---|---|
| Qwen2-VL-7B | Qwen2 backbone, dynamic resolution ViT. 10 images. | View Results |
| Qwen2.5-VL-32B | Qwen2.5-32B backbone, 64 layers. 10 images. | View Results |
| Molmo-7B-D | Qwen2 backbone, multi-crop ViT. 10 images. | View Results |
| LLaVA-1.5-7B | Vicuna backbone, CLIP ViT-L/14-336. 10 images. | View Results |
Exploring the impact of training variations and random seeds.
| Model / Variation | Details | Results |
|---|---|---|
| Seed 10 | 10 images | View Results |
| Seed 11 | 10 images | View Results |
| Linear Connector | 10 images | View Results |
| Unfreeze LLM | 10 images | View Results |
| First-Sentence Captions | 10 images | View Results |
| Earlier ViT Layer (6) | 10 images | View Results |
| Earlier ViT Layer (10) | 10 images | View Results |
| TopBottom Task | 10 images | View Results |
| TopBottom + Unfreeze | 10 images | View Results |