zhegan27 / VILLA

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part
https://arxiv.org/pdf/2006.06195.pdf
MIT License
119 stars 14 forks source link

How to extract features to do image retrieval #5

Open eugeneware opened 3 years ago

eugeneware commented 3 years ago

Thank you for this amazing piece of work.

I'm interested in using VILLA or UNITER to do image retrieval.

I'd like to pre-extract features from VILLA for a folder of images and then retrieve them at inference time by using a text query.

I note that in your paper you publish image retrieval and text retrieval metrics.

I've run the code as noted in the UNITER repo:

# text annotation preprocessing
bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann

# image feature extraction (Tested on Titan-Xp; may not run on latest GPUs)
bash scripts/extract_imgfeat.sh $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY

# image preprocessing
bash scripts/create_imgdb.sh $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db

Most of the scripts and examples I can see in the repo require both images and text to be presented to the model.

Do you have any examples or advice on how to get text-only representations/features that could be used to then retrieve images by their pre-encoded features?

Thanks for any help or guidance you can provide.

zhegan27 commented 3 years ago

@eugeneware , thanks for your inquiry. For UNITER & VILLA, both images & text are needed to be fed into the model, so text-only features cannot be obtained. This is done for better performance, as multimodal fusion is conducted at early stage. However, the inference can be very slow, since each text needs to be fused with every candidate image in order to get a similarity score.

From my understanding, what you want to do is to get text and image features separately, and then do a dot-product for image retrieval. So, my suggestion is that you can first try using BERT for text feature extraction, then train an image retrieval on top of it. Actually, my colleagues at Microsoft recently submitted a paper to NAACL 2021, and they have done pre-training in this new way so that image retrieval can be super fast. The paper is still in review though.

Hope it helps. Thanks.

Best, Zhe

eugeneware commented 3 years ago

Thanks so much for your reply @zhegan27 - so, to clarify, the Image Retrieval metrics in the paper were created by taking each text query, and then running it against every single image in the image corpus to get a similarity.ranking score - and then ordering the results by best match? If that's the case then that wouldn't work in a low latency inference environment.

But, when I look at the UniterModel base class I can see code there which allows you to pass in only text tokens, or only image features, or both? Is it unlikely that the text-only representation and the image-only representation would be not be similar in the shared embedding space?

Are you saying that putting in just image features and pre-computing the embedding output, and then trying to retrieve those image embeddings based on cosine distance/dot product of an embedding from just the text tokens is unlikely to work?

Thanks again for your help.

zhegan27 commented 3 years ago

@eugeneware , sorry that I am busy with paper deadlines this week, will get back to you this weekend or early next week. Thanks for your understanding.

eugeneware commented 3 years ago

@zhegan27 I completely understand. Good luck with your paper deadline. I really appreciate you being so generous with your time.