zmykevin / UVLP

CVPR 2022 (Oral) Pytorch Code for Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Other
22 stars 1 forks source link

Encoding for the file in data_preparation #2

Closed mengqiDyangge closed 2 years ago

mengqiDyangge commented 2 years ago

hi~

May I ask what encoding method is used for the files in data_preparation like "._Caption_Retrieve.py", and how should I view them?

zmykevin commented 2 years ago

Hi, those files are not meant to share at this moment as I haven't clean them yet. They mainly just introduce how we retrieve the sentence with the detected object tags in the image. The object detector is from VinVL and the sentence embedding is from Sentence Bert. Let me know if you have specific questions.

mengqiDyangge commented 2 years ago

Thanks for your nice reply~

I have two questions about the retrieval process:

  1. If the tags are connected by " " to form a tag sequence like "dog cat bus ..." ?

  2. How should I determine the connection order of tags, for example should it be "cat dog bus" or "dog cat bus" or "bus cat dog" ?

zmykevin commented 2 years ago

Hi, For the retrieval stage, you insert one object tag into sentence BERT at one time to get embedding for each object tag. Then to represent the object list you just use the mean of all the embeddings. In this case, the order does not matter.

During pre-training the object tag position embedding is associated with their relative position in the image. So the order you rank the object tags in the list also does not matter.

mengqiDyangge commented 2 years ago

HI~, I have tried the average pooling of all the tag embeddings for retrieval. However, I find that this strategy will lead to a strong bias in retrieving short sentences of the corpus. I wonder if this also happen in your experiments ? Do you have some suggestions to me ?

zmykevin commented 2 years ago

Hi Mengqi, we did not try to analyze the length distribution of our retrieved sentence, so it is likely that we are also suffering from that bias. Something that I can think of to help: (1) We did not use all the detected object tags from VinVL, where we filter out object tags with low detection confidence score as well as the one that has relatively smaller bounding box size (indicating that the object may not be visually important). (2) You can try to directly insert the object tag list as a string to the sentence bert model. We have tried this and I think the retrieved sentence looks quite good to me. However, this will lead to the question on how to actually rank the object list, which we did not study in our research. (3) Something that we did not try is to fine-tune a sentence model that learns to map list of objects to a natural sentence, i think this might be the best solution as it will learn the appropriate embedding to encode the object list as a special data structure instead of using a pre-trained model that only works for natural sentence. Hope these suggestions will help you.

mengqiDyangge commented 2 years ago

Thanks for your valuable suggestions, I will think about this problem further.