Is your method black-box or gray-box？

Thank you for your interest in our work. Existing open-source large VLMs are primarily composed of publicly available modules (e.g., CLIP and Vicuna). These facts increase the chance that the surrogate model used by black-box adversaries shares mutual information with the victim model, essentially making the victim model vulnerable to adversarial transferability.

Back to your question, the visual encoder of BLIP, is also a pre-trained CLIP encoder (see page-4 in paper). Nevertheless, even if the visual encoder contains CLIP, the CLIP module only occupies a small portion of the large VLMs (e.g., a CLIP encoder has ~300M parameters and a LLM like Vicuna may have ~13B parameters), where the main model capacity is involved in LLMs that are unseen to our transfer-based attackers, and it is non-trivial to fool large VLMs to return targeted responses solely based on adversarial transferability.

We used the pretrained CLIP. Based on our experience, if you use the entirely different visual encoder (e.g., another type of CLIP, or adversarially trained CLIP), it could degrade the ASR, but there are still successful cases of crafting the adv samples.

yunqing-me / AttackVLM

Is your method black-box or gray-box？ #13