zhuang-li / FactualSceneGraph

FACTUAL benchmark dataset, the pre-trained textual scene graph parser trained on FACTUAL.
https://arxiv.org/pdf/2305.17497.pdf
95 stars 12 forks source link

Maybe some hallucination in output? #5

Closed tiesanguaixia closed 5 months ago

tiesanguaixia commented 5 months ago

Thank you for the favorable work! I tried some complex sentences and found some hallucination in output. For example, the "phone" does not appear in the original sentence. What can I do to reduce the hallucination?

from factual_scene_graph.parser.scene_graph_parser import SceneGraphParser
import torch

if not torch.cuda.is_available():
    device = 'cpu'
else:
    device = 'cuda'

parser = SceneGraphParser('./flan-t5-base-VG-factual-sg', device=device)

caption = 'A person climbing down a rock edge while someone talks about donating to a cause.'
text_graph = parser.parse([caption], beam_size=3, return_text=True)
print(text_graph)

# ['( person , climb down , rock ) , ( person , talk on , phone )']
zhuang-li commented 5 months ago

I think the hallucination is more akin to a lack of contextual information, which I mentioned in the original paper. In the second part of the sentence, 'someone talks about donating to a cause,' if this is analyzed by a dependency parser like the Stanford parser, it might annotate it as 'someone, talks about, donating to a cause'. However, it should be annotated as 'someone, is, talking,' since 'donating to a cause' is obviously not an object. In the training set, however, most examples where the verb 'talk' appears follow the format of (subject, verb + preposition, object). Consequently, the parser learns a spurious correlation. But it already learns that 'donating to a cause' is not an object. Therefore, it seems the model infers a non-existent object.

In summary, a potential solution could involve incorporating image information to train a multimodal parser. As I am not a computer vision expert, I did not undertake this work and leave it for future research. My dataset includes the image region ids in Visual Genome. However, training a multimodal parser with a multimodal model like BLIP should be feasible, especially given the increasing availability of such models. But if it can work, it can be one of the contributions in your paper.

zhuang-li commented 5 months ago

And also, another solution is to use ChatGPT with in-context learning examples retrieved from my dataset using some similarity metrics. Maybe the significant larger model has more commonsense knowledge and can reduce such a problem.

tiesanguaixia commented 5 months ago

And also, another solution is to use ChatGPT with in-context learning examples retrieved from my dataset using some similarity metrics. Maybe the significant larger model has more commonsense knowledge and can reduce such a problem.

Thank you very much for the detailed explanation! I will check it later. By the way, may I ask during the training of your model, did you use the text captions from the multimodal datasets like MSR-VTT, MSVD, VATEX, and MS-COCO?

zhuang-li commented 5 months ago

I was just using Visual Genome. But as I recall, VG includes some images and captions from MS-COCO.