wusize / CLIPSelf

[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
https://arxiv.org/abs/2310.01403
Other
149 stars 8 forks source link

Generating text embedding files #8

Closed yhosoya66 closed 4 months ago

yhosoya66 commented 4 months ago

Hi, thank you for your excellent work and the well-organized code you've shared. I really appreciate it!

I'd like to ask a few questions, if that's okay.

I'm interested in fine-tuning F-ViT from CLIPSelf (available in this repository) on a different dataset. For this purpose, we need to create embedding files like 'datasets/embeddings/coco_with_background_evaclip_vitb_16.pt', which are essentially text embeddings for the target dataset, right?

Here are my questions:

  1. How can I create these embedding files for my own dataset? Could you provide some guidance or a script to make text embedding files for arbitrary dataset?
  2. Is the same text encoder consistently used across all configuration settings? If so, could you tell me which model you used?

Thanks for your help.

wusize commented 4 months ago

Hi, please use this script to generate text embeddings. Make sure using the corresponding text encoder to the ViT model.

yhosoya66 commented 4 months ago

Thank you for your prompt reply, It works when I set the argument '--cache_dir' as the target pre-trained weight.