Closed shams2023 closed 11 months ago
Hi @shams2023 - I am not sure what you are referring to. I only shared the link on social media, but I never wrote an "article" / e.g. blog post about it. Maybe somebody else did - can you give me some more details about what you are looking for? Do you need any specific information? Kind regards!
Hi @shams2023 - I am not sure what you are referring to. I only shared the link on social media, but I never wrote an "article" / e.g. blog post about it. Maybe somebody else did - can you give me some more details about what you are looking for? Do you need any specific information? Kind regards!
Hello author! I just want to know, what task is the main purpose of the code you uploaded? Can the following tasks be completed: Can we visualize which part of the image the text focuses on by using the clip model to obtain the corresponding modal feature representation of the image text pair? Thank you for your reply. Wishing you success in your work and studies! (你好作者! 我其实就是想知道,你上传的代码,主要是为了完成什么任务? 能否完成如下工作: 图像-文本对通过clip模型得到对应模态的特征表示,能否可视化文本到底关注到了图像的哪一部分吗? 感谢你的回复,祝你工作学业顺利!)
@shams2023 With attention visualization (alike "GradCAM"), you would normally provide the class, such as "cat" or "a photo of a dog" as a text input, together with the image input. Then, you could get an attention heat map for the features CLIP identifies in the image as most relevant to the class. So you could find that CLIP's attention is mostly on the eyes for "cat", but might be focused on a dog's ear as most relevant feature for 'dog'.
However, this basically "forces" CLIP to see a pre-defined class (text provided by human user).
This code, however, first uses CLIP gradient ascent on the input image -> output is text tokens / words, the prediction of a class / classes that CLIP itself "sees" in the image. It basically obtains CLIP's "opinion" about what the image contains, according to the features CLIP has learned during training. For a photo of a cat, CLIP may also predicts "cat" as one the classes - but CLIP predicts further words / classes, some that might seem unusual or non-sensical to humans, such as "domino" or "stripe" or "fanci" or "catmented", as CLIP will identify local patterns (it might see a "map" in a cat fur pattern, too), as well as background patterns that humans might not be paying attention to:
It also reveals what is known as CLIP's typographic attack vulnerability, the general tendency of the model to overfit on text (but it recognized both "apple" and "ipod" in this famous example image from OpenAI's blog - as well as "fakepods"). This will result in a very different attention heatmap, that shows very well how CLIP is overly focused on text - especially when you compare it to the attention when the text has been removed (photoshopped):
So, to summarize:
The main difference is that we do not tell CLIP what it should see in the image (by defining the text / class it should "look for"), but we get CLIP's own "opinion" (text prediction) of what the image contains first - and then we visualize that as an attention heatmap.
I hope this helps! 👋
With attention visualization (similar to "GradCAM"), you'll typically have a class (such as a "cat" or "dog photo") as text input, as well as an image input. You can then get an attention heatmap of the features that CLIP identifies in the image as the most relevant to the class. As a result, you'll find that CLIP's attention is primarily focused on the "cat's" eyes, but may focus on the dog's ears because of the features most associated with the "dog."
However, this basically "forces" CLIP to see a predefined class (text provided by a human user).
However, this code first uses a CLIP gradient ascent on the input image, and the > output is a text marker/word, i.e. a prediction of the class/class that the CLIP itself "sees" in the image. It basically gets CLIP's "opinion" of what the image contains based on the features that CLIP learns in training. For photos of cats, CLIP may also predict "cat" as one of the categories - but CLIP predicts other words/categories, some that may seem unusual or meaningless to humans, such as "dominoes" or "stripes" or "fantasies" or "cats" because CLIP recognizes local patterns (it may also see "maps" in cat hair patterns), as well as background patterns that humans may not be paying attention to:
It also reveals the so-called typographic exploit vulnerability of CLIP, the general tendency of the model to overfit textually (but it identifies "apple" and "ipod" – as well as "fakepods" – in this famous example image from the OpenAI blog). This will result in a very different attention heatmap, which does a good job of showing how CLIP is overly focused on text – especially when you compare it to the attention when the text is deleted (photoshopped):
So, to sum it up:
- We get CLIP's own "textual opinion", i.e. what the CLIP primarily "sees" in the image (features and associated tags/classes), rather than telling the CLIP what it must see in the image.
- We then use CLIP's own prediction words/"opinions" of salient features in the image to create a heatmap of the [text-class]-[image] pairs.
The main difference is that we don't tell CLIP what it should see in the image (by defining the text/class it should be "looking for"), but we first get CLIP's own "opinion" (text prediction) of what the image contains – and then we visualize it as an attention heat map.
I hope this helps! 👋
Thank you very much for your help. Thank you again, thank you!
Hello author May I ask which article did you mention this visualization operation in? I really need him, thank you!