zer0int / CLIP-text-image-interpretability

Get CLIP ViT text tokens about an image, visualize attention as a heatmap.
6 stars 0 forks source link
attention-visualization clip gradient-ascent heatmap image-to-text vit xai

CLIP text-image interpretability

Visual Interpretability / XAI Tool for CLIP ViT (Vision Transformer) models

banner

Credits & Prerequisites

Overview

In simple terms: Feeds an image to a CLIP ViT vision transformer to obtain "a CLIP opinion" / words (text tokens) about the image (gradient ascent), then uses the [token] + [image] pair to visualize what CLIP is "looking at" (attention visualization), producing an overlay "heatmap" image.

Setup

  1. Install OpenAI/CLIP and hila-chefer/Transformer-MM-Explainability
  2. Put the contents of this repo into the "/Transformer-MM-Explainability" folder
  3. Execute "python runall.py" from the command line, follow instructions
  4. Or run the individual scripts separately, check runall.py for details
  5. You should have most requirements from the prequisite installs (1.), except maybe kornia ("pip install kornia")
  6. Requires a minimum amount of 4 GB VRAM (CLIP ViT-B/32). Check clipga.py and adjust batch size if you get a CUDA OOM, or to use a different model
  7. Use the same CLIP ViT model for clipga.py and runexplain.py (defined at top of code, "clipmodel=")... Or experiment around!

What does a vision transformer "see"?

Examples:

what-clip-sees

attention-guided

interoperthunderbirds