Closed 1057939502 closed 7 months ago
Thank you for your interest in our work. Existing open-source large VLMs are primarily composed of publicly available modules (e.g., CLIP and Vicuna). These facts increase the chance that the surrogate model used by black-box adversaries shares mutual information with the victim model, essentially making the victim model vulnerable to adversarial transferability.
Back to your question, the visual encoder of BLIP, is also a pre-trained CLIP encoder (see page-4 in paper). Nevertheless, even if the visual encoder contains CLIP, the CLIP module only occupies a small portion of the large VLMs (e.g., a CLIP encoder has ~300M parameters and a LLM like Vicuna may have ~13B parameters), where the main model capacity is involved in LLMs that are unseen to our transfer-based attackers, and it is non-trivial to fool large VLMs to return targeted responses solely based on adversarial transferability.
We used the pretrained CLIP. Based on our experience, if you use the entirely different visual encoder (e.g., another type of CLIP, or adversarially trained CLIP), it could degrade the ASR, but there are still successful cases of crafting the adv samples.
Should I classify your attack method as a gray-box attack? In your paper, you rarely mention how to choose the surrogate model. I think that surrogate model.should not be part of the target model If your attack is black box. However, you used BLIP-encoder as the surrogate model to attack BLIP and you used CLIP-encoder(components of unidiff) as the surrogate model to attack unidiff.