Hi @yuleiniu Thank you for your great work! I have two quick questions:
It seems to me that the core idea is very similar to Tang's unbiased SGG (CVPR'20) in that both works aim to remove the bad co-occurrence bias by subtracting the results with certain data blocked out (image modality/image patches). Is there any misunderstanding here?
The discussion of "good" and "bad" biases: it seems to me that the "bad" language bias can be removed by the proposed method; however, it seems to also remove the "good" ones. I didn't find a detailed discussion on your main motivation of removing the bad biases and retaining the good ones, neither further discussion beyond the Introduction section nor experimental proofs, in the paper. How do the good ones remain? Could you please elaborate on this?
Hi @coldmanck Thanks for your interest in our work!
We share some common ideas in mitigating the dataset bias, such as analyzing the causal relations and formulating the bias as causal effects. Reading both of the papers would help to understand the idea.
CF-VQA reserves the "good" language context, compared to language-prior-based (or ensemble-based) methods like RUBi and Learned-Mixin. We argue that these methods regard the language bias and language context as a whole, i.e., language prior, and simply mitigate the whole prior by removing the QA branch during the test stage. Differently, we keep the QA branch during the test stage to provide the language context (see Figure 4). Qualitative comparison with RUBi is provided in Section 5.2, which shows that ours can maintain the "good" compared to RUBi. I agree that it is difficult to perfectly disentangle the "good" and the "bad", and we move one step forward to this compared to previous works.
Hi @yuleiniu Thank you for your great work! I have two quick questions: