Questions regarding the experimental data of the paper

Could you explain how you concluded that the difference between the similarity score of AI-generated images with text and that of real images with text would increase when mixing AI-generated and real images for retrieval? I have obtained consistent results when calculating the similarity between AI-generated images and text, and between real images and text separately, compared to when mixing AI-generated and real images for the calculation.

Please refer to our paper, we said "as the mixing ratio of AI-generated images in training data increases", rather than just mixing.

Please refer to our paper, we said "as the mixing ratio of AI-generated images in training data increases", rather than just mixing.

Thank you for your response! In Section 2.2.1, the statement "considering the excellent zero-shot text-image retrieval performance of the three models, we use these models directly for retrieval in zero-shot setting"—can I understand this as meaning that you directly used the pre-trained large models to perform retrieval on datasets containing only generated images and only real images, without any training or fine-tuning? In Table 2, we can see that the N@ for generated and real images are actually close (taking FLAVA as an example). In Section 3.2, you mentioned, "For the models that have been pre-trained on massive real text-image pairs and show excellent zero-shot performance in text-image retrieval, we directly use these pre-trained models to perform retrieval on the test datasets." Does this refer to using FLAVA for retrieval on a mixed dataset of AI-generated images and real images as the retrieval corpus? It seems that these mixed images were not used as training samples, correct?

I noticed that in Table 4, when performing retrieval on the mixed dataset of real and AI-generated images, the N@ gap between the generated and real images becomes quite large, which is different from the results seen when performing retrieval on them separately in Table 2. This brings me to a question: how was this mixed N@ obtained? When I used your code to perform retrieval on real and generated images separately/mixed, I found that their similarities (the dot product of text and image encodings) were the same. In your code, are the ndcg@1 and ndcg_g@1 metrics referring to the results in Table 2 or Table 4?

I’m looking forward to your response, as it’s very important to me. Thank you!

Please refer to our paper, we said "as the mixing ratio of AI-generated images in training data increases", rather than just mixing.

Thank you for your response! In Section 2.2.1, the statement "considering the excellent zero-shot text-image retrieval performance of the three models, we use these models directly for retrieval in zero-shot setting"—can I understand this as meaning that you directly used the pre-trained large models to perform retrieval on datasets containing only generated images and only real images, without any training or fine-tuning? In Table 2, we can see that the N@ for generated and real images are actually close (taking FLAVA as an example). In Section 3.2, you mentioned, "For the models that have been pre-trained on massive real text-image pairs and show excellent zero-shot performance in text-image retrieval, we directly use these pre-trained models to perform retrieval on the test datasets." Does this refer to using FLAVA for retrieval on a mixed dataset of AI-generated images and real images as the retrieval corpus? It seems that these mixed images were not used as training samples, correct?

I noticed that in Table 4, when performing retrieval on the mixed dataset of real and AI-generated images, the N@ gap between the generated and real images becomes quite large, which is different from the results seen when performing retrieval on them separately in Table 2. This brings me to a question: how was this mixed N@ obtained? When I used your code to perform retrieval on real and generated images separately/mixed, I found that their similarities (the dot product of text and image encodings) were the same. In your code, are the ndcg@1 and ndcg_g@1 metrics referring to the results in Table 2 or Table 4?

I’m looking forward to your response, as it’s very important to me. Thank you!

You should not associate Table 2 with subsequent Sections. As our Section 2 introduced, Table 2 just aims to assess the quality of the generated images rather than evaluate the bias. The retrieval performance on the corpus containing only generated images should not change significantly compared to retrieval performance on real images only. This can ensure that distinguishability between generated images is consistent with real images and no additional visual semantics relevant (or irrelevant) to the query are introduced during the image generation.

Table 4 is the main table shows that the retrieval models prefer AI-generated images on the mixed corpus with both real and AI images (metric Relative△).

As for your question about "as the mixing ratio of AI-generated images in training data increases", it is actually in Figure 2 and 3 in Section 3.4 that we add the AI-generated images to the training data.

Please refer to our paper, we said "as the mixing ratio of AI-generated images in training data increases", rather than just mixing.

Thank you for your response! In Section 2.2.1, the statement "considering the excellent zero-shot text-image retrieval performance of the three models, we use these models directly for retrieval in zero-shot setting"—can I understand this as meaning that you directly used the pre-trained large models to perform retrieval on datasets containing only generated images and only real images, without any training or fine-tuning? In Table 2, we can see that the N@ for generated and real images are actually close (taking FLAVA as an example). In Section 3.2, you mentioned, "For the models that have been pre-trained on massive real text-image pairs and show excellent zero-shot performance in text-image retrieval, we directly use these pre-trained models to perform retrieval on the test datasets." Does this refer to using FLAVA for retrieval on a mixed dataset of AI-generated images and real images as the retrieval corpus? It seems that these mixed images were not used as training samples, correct?

I noticed that in Table 4, when performing retrieval on the mixed dataset of real and AI-generated images, the N@ gap between the generated and real images becomes quite large, which is different from the results seen when performing retrieval on them separately in Table 2. This brings me to a question: how was this mixed N@ obtained? When I used your code to perform retrieval on real and generated images separately/mixed, I found that their similarities (the dot product of text and image encodings) were the same. In your code, are the ndcg@1 and ndcg_g@1 metrics referring to the results in Table 2 or Table 4?

I’m looking forward to your response, as it’s very important to me. Thank you!

You can read our Section 2 carefully, Table 2 aims to "indicate that the AI-generated images in our benchmark do not introduce more visual semantics relevant to the queries." From the perspective of fair assessment, a reasonable scenario for assessing the potential bias requires that the generated images and the real images must have sufficiently similar visual semantics. This can avoid increasing or decreasing some semantic associations between the generated images and the queries caused by image generation. That is, the IR model preferring (or rejecting) an AI-generated image that is more (or less) semantically relevant to the query than the real image cannot prove the existence (or nonexistence) of the bias.

Please refer to our paper, we said "as the mixing ratio of AI-generated images in training data increases", rather than just mixing.

Thank you for your response! In Section 2.2.1, the statement "considering the excellent zero-shot text-image retrieval performance of the three models, we use these models directly for retrieval in zero-shot setting"—can I understand this as meaning that you directly used the pre-trained large models to perform retrieval on datasets containing only generated images and only real images, without any training or fine-tuning? In Table 2, we can see that the N@ for generated and real images are actually close (taking FLAVA as an example). In Section 3.2, you mentioned, "For the models that have been pre-trained on massive real text-image pairs and show excellent zero-shot performance in text-image retrieval, we directly use these pre-trained models to perform retrieval on the test datasets." Does this refer to using FLAVA for retrieval on a mixed dataset of AI-generated images and real images as the retrieval corpus? It seems that these mixed images were not used as training samples, correct? I noticed that in Table 4, when performing retrieval on the mixed dataset of real and AI-generated images, the N@ gap between the generated and real images becomes quite large, which is different from the results seen when performing retrieval on them separately in Table 2. This brings me to a question: how was this mixed N@ obtained? When I used your code to perform retrieval on real and generated images separately/mixed, I found that their similarities (the dot product of text and image encodings) were the same. In your code, are the ndcg@1 and ndcg_g@1 metrics referring to the results in Table 2 or Table 4? I’m looking forward to your response, as it’s very important to me. Thank you!

You can read our Section 2 carefully, Table 2 aims to "indicate that the AI-generated images in our benchmark do not introduce more visual semantics relevant to the queries." From the perspective of fair assessment, a reasonable scenario for assessing the potential bias requires that the generated images and the real images must have sufficiently similar visual semantics. This can avoid increasing or decreasing some semantic associations between the generated images and the queries caused by image generation. That is, the IR model preferring (or rejecting) an AI-generated image that is more (or less) semantically relevant to the query than the real image cannot prove the existence (or nonexistence) of the bias.

Are the real images and generated images used in Table 2 not the same as those used in Table 4? Why is there such a large gap in N@?

Please refer to our paper, we said "as the mixing ratio of AI-generated images in training data increases", rather than just mixing.

Thank you for your response! In Section 2.2.1, the statement "considering the excellent zero-shot text-image retrieval performance of the three models, we use these models directly for retrieval in zero-shot setting"—can I understand this as meaning that you directly used the pre-trained large models to perform retrieval on datasets containing only generated images and only real images, without any training or fine-tuning? In Table 2, we can see that the N@ for generated and real images are actually close (taking FLAVA as an example). In Section 3.2, you mentioned, "For the models that have been pre-trained on massive real text-image pairs and show excellent zero-shot performance in text-image retrieval, we directly use these pre-trained models to perform retrieval on the test datasets." Does this refer to using FLAVA for retrieval on a mixed dataset of AI-generated images and real images as the retrieval corpus? It seems that these mixed images were not used as training samples, correct? I noticed that in Table 4, when performing retrieval on the mixed dataset of real and AI-generated images, the N@ gap between the generated and real images becomes quite large, which is different from the results seen when performing retrieval on them separately in Table 2. This brings me to a question: how was this mixed N@ obtained? When I used your code to perform retrieval on real and generated images separately/mixed, I found that their similarities (the dot product of text and image encodings) were the same. In your code, are the ndcg@1 and ndcg_g@1 metrics referring to the results in Table 2 or Table 4? I’m looking forward to your response, as it’s very important to me. Thank you!

You can read our Section 2 carefully, Table 2 aims to "indicate that the AI-generated images in our benchmark do not introduce more visual semantics relevant to the queries." From the perspective of fair assessment, a reasonable scenario for assessing the potential bias requires that the generated images and the real images must have sufficiently similar visual semantics. This can avoid increasing or decreasing some semantic associations between the generated images and the queries caused by image generation. That is, the IR model preferring (or rejecting) an AI-generated image that is more (or less) semantically relevant to the query than the real image cannot prove the existence (or nonexistence) of the bias.

Thank you for your response! However, I still have some confusion. When only using real images and AI-generated images separately, why is the N@ you obtained so different from the one in Table 4 for FLAVA? In Table 2, the N@ for real images is sometimes higher than that for AI-generated images, but in Table 4, the N@ for real images is so low.

So my question is, in the code you provided, are the printed ndcg@ and ndcg_g@ values for the separate AI/real images, like those in Table 2, or are they the ndcg@ and ndcg_g@ values after mixing the real and AI images, as shown in Table 4? If it is the first case, how are the mixed ndcg@ and ndcg_g@ values for AI and real images in Table 4 obtained? If it is the second case, how are the separate ndcg@ and ndcg_g@ values in Table 2 obtained? This would be very helpful for me.

Table 2 is for the corpus with only real or AI images but Table 4 is for the corpus with mixed images, so N@k is different between them.

xsc1234 / Invisible-Relevance-Bias

Questions regarding the experimental data of the paper #3