Questions about attack on BLIP (LAVIS)

ericyinyzy commented 1 year ago

Thank you for releasing the codes and providing an in-depth analysis in the paper.

I have the following two questions when reproducing the attack codes on the model blip2 in LAVIS_tool.

For the transfer attack implemented in the script LAVIS_tool/_train_adv_img_blip.py, the target image features come from the following function:

, where blip_model is loaded through lavis package. However, it seems like there is no implementation of the forward_encoder_image function in the source code of lavis, thus triggering the error: AttributeError: 'Blip2OPT' object has no attribute 'forward_encoder_image'. May I ask where does the tgt_image_features come from? Does it directly come from the image encoder like the vit (257 tokens) or the output of the query former in BLIP-2 (32 tokens)?

Additionally, based on the codes, it seems like the surrogate image encoder used for generating the adversary is the same as the image encoder of the victim model BLIP2？ However, using the same image encoder might pose issues for the black-box attack setting illustrated in the paper? (Maybe I misunderstood this part.)

Thank you again for releasing the codes! I hope to hear your thoughts on the above questions.

Yuancheng-Xu commented 1 year ago

Regarding point 2, I had the same doubts. Did the authors push the correct version of the code?

yunqing-me commented 1 year ago

Thank you for your interest in our work. Existing open-source large VLMs are primarily composed of publicly available modules (e.g., CLIP and Vicuna). These facts increase the chance that the surrogate model used by black-box adversaries shares mutual information with the victim model, essentially making the victim model vulnerable to adversarial transferability.

Regarding your question, the visual encoder of BLIP-2, is actually a pre-trained CLIP ViT-L/14 encoder (see page-4 in BLIP-2 paper). Nevertheless, even if the visual encoder contains CLIP, the CLIP module only occupies a small portion of the large VLMs (e.g., a CLIP encoder has ~300M parameters and a LLM like Vicuna may have ~13B parameters), where the main model capacity is involved in LLMs that are unseen to our transfer-based attackers, and it is non-trivial to fool large VLMs to return targeted responses solely based on adversarial transferability.

The tgt_image_features come from the image encoder vit, and I will update a note here.

Thanks

kz29 commented 8 months ago

Hi @ericyinyzy , Were you able to solve the first question regarding 'forward_encoder_image'? Thank you

yunqing-me / AttackVLM

Questions about attack on BLIP (LAVIS) #7