CLIP text-image interpretability
Visual Interpretability / XAI Tool for CLIP ViT (Vision Transformer) models
Credits & Prerequisites
Overview
In simple terms: Feeds an image to a CLIP ViT vision transformer to obtain "a CLIP opinion" / words (text tokens) about the image (gradient ascent), then uses the [token] + [image] pair to visualize what CLIP is "looking at" (attention visualization), producing an overlay "heatmap" image.
Setup
- Install OpenAI/CLIP and hila-chefer/Transformer-MM-Explainability
- Put the contents of this repo into the "/Transformer-MM-Explainability" folder
- Execute "python runall.py" from the command line, follow instructions
- Or run the individual scripts separately, check runall.py for details
- You should have most requirements from the prequisite installs (1.), except maybe kornia ("pip install kornia")
- Requires a minimum amount of 4 GB VRAM (CLIP ViT-B/32). Check clipga.py and adjust batch size if you get a CUDA OOM, or to use a different model
- Use the same CLIP ViT model for clipga.py and runexplain.py (defined at top of code, "clipmodel=")... Or experiment around!
What does a vision transformer "see"?
- Find out what CLIP's attention is on for a given image, explore bias as well as sophistication and broad concepts learned by the AI
- Use CLIP's "opinion" + heatmap image verification, then try to prompt your favorite text-to-image AI with those tokens. YES! Even the "crazy tokens"; after all, it's a CLIP steering the image towards your prompt inside a text-to-image AI system!
Examples: