How exactly is your query constructed to get the image regions

zjunlp / HVPNeT

[NAACL 2022 Findings] Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

MIT License

96 stars 10 forks source link

How exactly is your query constructed to get the image regions #21

Closed AdrewTomas closed 1 month ago

AdrewTomas commented 2 months ago

Hi, I see that there are some issues that mention "the problem of how to get the corresponding image regions with the visual grouding tools", but there doesn't seem to be a very detailed description and corresponding code.

My question is: how exactly is your query constructed, do the nouns come only from the nouns recognized in the text, or do you add the entity categories as well? Can you give a couple of examples used to illustrate the process of how to get the corresponding image region from a text and image?

Thank you very much ^^

njcx-ai commented 2 months ago

Hello, thank you very much for your interest in our work. The nouns used in our approach are extracted solely from the nouns recognized within the input text. Subsequently, we employ visual grounding techniques to identify and localize the corresponding image patches that semantically align with the extracted nouns.

AdrewTomas commented 2 months ago

But you guys say in your paper that you are following UMGF which also uses entity types as query. May I ask if you guys are using them as well? I would like you to provide a couple of examples to make it clear exactly how it is done.

njcx-ai commented 2 months ago

Thank you for your interest. I will verify the specific implementation in the code and get back to you with more details.

njcx-ai commented 1 month ago

Specifically, we first perform extraction based on nouns as queries. If the number of extracted nouns is relatively low, we incorporate some entity types to assist in the extraction process.

AdrewTomas commented 1 month ago

OK, thank you.