LatentLens Interactive Demo

Interpreting visual tokens in LLMs with contextual embeddings

About This Demo

This demo accompanies the paper "LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs". We train connector-only VLMs (frozen LLM + frozen vision encoder) and analyze interpretability of visual tokens at each layer using three methods:

Click on any model cell below to explore per-patch interpretability across layers.

LLM \ Vision EncoderViT-L/14-336DINOv2-L-336SigLIP
LLaMA3-8B
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
OLMo-7B
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
View Results
LN: 9 | LL: 15 | NN: 15
Qwen2-7B
View Results
LN: 9 | LL: 14 | NN: 11
View Results
LN: 9 | LL: 14 | NN: 11
View Results
LN: 9 | LL: 14 | NN: 11

Ablation Studies & Additional Models

Exploring the impact of training variations, random seeds, and off-the-shelf models.

Model / Variation Details Results
Qwen2-VL (off-the-shelf) 10 images View Results
Seed 10 10 images View Results
Seed 11 10 images View Results
Linear Connector 10 images View Results
Unfreeze LLM 10 images View Results
First-Sentence Captions 10 images View Results
Earlier ViT Layer (6) 10 images View Results
Earlier ViT Layer (10) 10 images View Results
TopBottom Task 10 images View Results
TopBottom + Unfreeze 10 images View Results
Available (click to view)
No results available