LatentLens Interactive Demo

Interpreting visual tokens in LLMs with contextual embeddings

About This Demo

This demo accompanies the paper "LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs". We train connector-only VLMs (frozen LLM + frozen vision encoder) and analyze interpretability of visual tokens at each layer using three methods:

Click on any model cell below to explore per-patch interpretability across layers.

LLM \ Vision EncoderViT-L/14-336DINOv2-L-336SigLIP
LLaMA3-8B
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
OLMo-7B
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
Qwen2-7B
View Results
LN: 9 | LL: 14 | NN: 11
View Results
LN: 9 | LL: 14 | NN: 11
View Results
LN: 9 | LL: 14 | NN: 11

Off-the-Shelf VLMs

Pre-trained VLMs analyzed without connector-only training. These models were trained end-to-end by their respective teams.

Model Details Results
Qwen2-VL-7B Qwen2 backbone, dynamic resolution ViT. 10 images. View Results
Qwen2.5-VL-32B Qwen2.5-32B backbone, 64 layers. 10 images. View Results
Molmo-7B-D Qwen2 backbone, multi-crop ViT. 10 images. View Results
LLaVA-1.5-7B Vicuna backbone, CLIP ViT-L/14-336. 10 images. View Results

Ablation Studies

Exploring the impact of training variations and random seeds.

Model / Variation Details Results
Seed 10 10 images View Results
Seed 11 10 images View Results
Linear Connector 10 images View Results
Unfreeze LLM 10 images View Results
First-Sentence Captions 10 images View Results
Earlier ViT Layer (6) 10 images View Results
Earlier ViT Layer (10) 10 images View Results
TopBottom Task 10 images View Results
TopBottom + Unfreeze 10 images View Results
Available (click to view)
No results available