About the selection of text and image conditional inputs ?

Hi, yuhang:

Thanks for your patient replies before! I still have a small question about the code of text and image conditional inputs selection.

As your code, clip_query = text_query * mask + img_query * (1 - mask) (line 308 in file ovdetr/models/model.py), the text and image conditional inputs are selected randomly by the mask generated by mask = (torch.rand(len(text_query)) < self.prob).float().unsqueeze(1).to(text_query.device) (line 302 in file ovdetr/models/model.py).

But, as the paper said, the text conditional inputs of novel classes cannot be used during training. So, the mask in line 302 of file ovdetr/models/model.py need to be further processed by setting the locations corresponding to novel classes to zero. Am I right?

yuhangzang / OV-DETR

About the selection of text and image conditional inputs ? #6