Which sentence-bert model was used? And where is it implemented in the code?

zmykevin / UVLP

CVPR 2022 (Oral) Pytorch Code for Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

Other

21 stars 1 forks source link

Which sentence-bert model was used? And where is it implemented in the code? #3

Open TheShadow29 opened 2 years ago

TheShadow29 commented 2 years ago

Hello, thanks for the code. I cannot find the detail in the paper as to which sbert model was used? Also, could you clarify how to use it in the training pipeline?

zmykevin commented 2 years ago

Here is the github link for the sentence Bert we use: https://github.com/UKPLab/sentence-transformers. The sentence Bert Embedding is only used when we retrieve the weakly aligned sentence for each image. These are generated before we start the pre-training. If you take a look at the dataset we share for pre-training, those are the weakly aligned pairs we create for each image. Every image is paired with 5 sentences.

TheShadow29 commented 2 years ago

@zmykevin Thanks for the pointer. I came across the repository, but couldn't figure out which model to use. Could you clarify how you generate the weakly aligned sentences? Again, thanks for the help.

zmykevin commented 2 years ago

The specific model we use is: "paraphrase-MiniLM-L6-v2". The introduction on how the sentence is retrieved for each image is introduced in section 3.2. Please let me know if you have any specific questions for this section.

TheShadow29 commented 2 years ago

@zmykevin Thanks for the reply. How do you perform the retrieval for large number image and text sets? Do you have any particular implementation?

zmykevin commented 2 years ago

We use FAISS: https://github.com/facebookresearch/faiss to compute the similarity between the object list embedding and natural sentence embedding.

TheShadow29 commented 2 years ago

@zmykevin Thanks for the reply. Do you happen to have it implemented somewhere (I couldn't find it in the repo)? Did you use the normal FlatIP (i.e. normal dot product) or any of the other optimizations like IndexIVFFlat?