Closed ericyinyzy closed 1 year ago
Regarding point 2, I had the same doubts. Did the authors push the correct version of the code?
Thank you for your interest in our work. Existing open-source large VLMs are primarily composed of publicly available modules (e.g., CLIP and Vicuna). These facts increase the chance that the surrogate model used by black-box adversaries shares mutual information with the victim model, essentially making the victim model vulnerable to adversarial transferability.
Regarding your question, the visual encoder of BLIP-2, is actually a pre-trained CLIP ViT-L/14 encoder (see page-4 in BLIP-2 paper). Nevertheless, even if the visual encoder contains CLIP, the CLIP module only occupies a small portion of the large VLMs (e.g., a CLIP encoder has ~300M parameters and a LLM like Vicuna may have ~13B parameters), where the main model capacity is involved in LLMs that are unseen to our transfer-based attackers, and it is non-trivial to fool large VLMs to return targeted responses solely based on adversarial transferability.
The tgt_image_features
come from the image encoder vit
, and I will update a note here.
Thanks
Hi @ericyinyzy , Were you able to solve the first question regarding 'forward_encoder_image'? Thank you
Thank you for releasing the codes and providing an in-depth analysis in the paper.
I have the following two questions when reproducing the attack codes on the model
blip2
inLAVIS_tool
.LAVIS_tool/_train_adv_img_blip.py
, the target image features come from the following function:, where
blip_model
is loaded throughlavis
package. However, it seems like there is no implementation of theforward_encoder_image
function in the source code oflavis
, thus triggering the error:AttributeError: 'Blip2OPT' object has no attribute 'forward_encoder_image'
. May I ask where does thetgt_image_features
come from? Does it directly come from the image encoder like thevit
(257 tokens) or the output of the query former in BLIP-2 (32 tokens)?Thank you again for releasing the codes! I hope to hear your thoughts on the above questions.