Several Questions on training and inference

Hello! It is a very insteresting work for build open vocabulary learning with DETR.

We read this paper but we have several questiones:

1, What does the "R" means ? How does it involve the training and testing? In fig.4, It seems that R means the number of Class during the class？

2, What is the ground truth during the matching (p in Equ.6) ? What is the relation to R ? We are very confusing on Fig.3(b).

3, How to handle the novel during the training? How to indentify the novel proposals? How to use it for training since no GTs are included？

Thanks!!!!!!

same confusion here, especially on how to get the gt for binary matching. here are some of my thoughts, for text inputs, you transform the multi-class labels by filtering the labels that do not belong to the current text, also the class id of the labels that align with the text should be set to 1. for image inputs, the decoder box predictions aligned with the proposal input are set to the matched class 1. but wouldn't that be very time-consuming during the training, especially you have multiple copies of the queries and the conditional image inputs? because there are mask r-cnn, clip and detr involved?

Hi @lxtGH , Thanks for your interest in our work. For your questiones:

The symbol 'R' means the number of repeat times for our conditional inputs. For example, DETR has N=100 object queries for the Transformer decoder. For our OV-DETR, we first repeat the queries R times and get the NxR queries (e.g., 100x3=300). Then we add the conditional inputs for these NxR queries according to the equation (4). For example, the 0-99 queries are added with the CLIP image embedding of 'cat', the 100-199 queries are added with the CLIP text embedding of 'dog', etc. The symbol R is independent of the number of categories.
The ground truth depends on the conditional inputs. For example, if we add the CLIP text embedding of 'dog' as the conditional inputs, the ground truth bounding box of class 'dog' within the batch will be treated as the ground truth.
Like the ViLD, we pre-trained the model on base classes to extract the object proposals and get their CLIP image embeddings as the conditional inputs. The detector is required to predict which regions in the image matched with the conditional image features from object proposals (like the image retrieval task).

Hi @lxtGH , Thanks for your interest in our work. For your questiones:

The symbol 'R' means the number of repeat times for our conditional inputs. For example, DETR has N=100 object queries for the Transformer decoder. For our OV-DETR, we first repeat the queries R times and get the NxR queries (e.g., 100x3=300). Then we add the conditional inputs for these NxR queries according to the equation (4). For example, the 0-99 queries are added with the CLIP image embedding of 'cat', the 100-199 queries are added with the CLIP text embedding of 'dog', etc. The symbol R is independent of the number of categories.

The ground truth depends on the conditional inputs. For example, if we add the CLIP text embedding of 'dog' as the conditional inputs, the ground truth bounding box of class 'dog' within the batch will be treated as the ground truth.

Like the ViLD, we pre-trained the model on base classes to extract the object proposals and get their CLIP image embeddings as the conditional inputs. The detector is required to predict which regions in the image matched with the conditional image features from object proposals (like the image retrieval task).

@yuhangzang Thank you for your reply! I am still bit confused about the process. This is my understanding of the matching process, correct me if I am wrong.

Let's assume there are 3 bounding box annotations in the input image {cat, dog, apple}. For each box, you first generate their CLIP embeddings, so there will be 3 embedding vectors z_cat, z_dog and z_apple.
For novel object proposals, You replace the classification layer with OV CLIP-text embedding and train on base classes. You will generate some numbers, lets say m=2 of embedding vectors z_novel. So there are totally 5 conditional input embeddings? How do you determine if a proposal is novel and what does row2 in Table 2 mean? And also, how do you handle the background class during the base training stage?
For the matching, by default there are N=100 queries. You copy them by R=3 times. Then how to allocate between NxR queries and 5 conditional input embeddings? Do you simply do even allocation? Like in this case, each z will be added to 300/5=60 queries?
The GT label for each class conditional query is simply {0, 1}. For example, the queries conditioned on cat will have only 1 getting matched to cat while the rest will not? If there are M cats in the image then M queries will get matched to cat?

Hi @lxtGH , Thanks for your interest in our work. For your questiones:

The symbol 'R' means the number of repeat times for our conditional inputs. For example, DETR has N=100 object queries for the Transformer decoder. For our OV-DETR, we first repeat the queries R times and get the NxR queries (e.g., 100x3=300). Then we add the conditional inputs for these NxR queries according to the equation (4). For example, the 0-99 queries are added with the CLIP image embedding of 'cat', the 100-199 queries are added with the CLIP text embedding of 'dog', etc. The symbol R is independent of the number of categories.

The ground truth depends on the conditional inputs. For example, if we add the CLIP text embedding of 'dog' as the conditional inputs, the ground truth bounding box of class 'dog' within the batch will be treated as the ground truth.

Like the ViLD, we pre-trained the model on base classes to extract the object proposals and get their CLIP image embeddings as the conditional inputs. The detector is required to predict which regions in the image matched with the conditional image features from object proposals (like the image retrieval task).

@yuhangzang Thank you for your reply! I am still bit confused about the process. This is my understanding of the matching process, correct me if I am wrong.

Let's assume there are 3 bounding box annotations in the input image {cat, dog, apple}. For each box, you first generate their CLIP embeddings, so there will be 3 embedding vectors z_cat, z_dog and z_apple.

For novel object proposals, You replace the classification layer with OV CLIP-text embedding and train on base classes. You will generate some numbers, lets say m=2 of embedding vectors z_novel. So there are totally 5 conditional input embeddings? How do you determine if a proposal is novel and what does row2 in Table 2 mean? And also, how do you handle the background class during the base training stage?

For the matching, by default there are N=100 queries. You copy them by R=3 times. Then how to allocate between NxR queries and 5 conditional input embeddings? Do you simply do even allocation? Like in this case, each z will be added to 300/5=60 queries?

The GT label for each class conditional query is simply {0, 1}. For example, the queries conditioned on cat will have only 1 getting matched to cat while the rest will not? If there are M cats in the image then M queries will get matched to cat?

Hello, has your question been solved? I've read it many times, but I still don't understand it

Hello! It is a very insteresting work for build open vocabulary learning with DETR.

We read this paper but we have several questiones:

1, What does the "R" means ? How does it involve the training and testing? In fig.4, It seems that R means the number of Class during the class？

2, What is the ground truth during the matching (p in Equ.6) ? What is the relation to R ? We are very confusing on Fig.3(b).

3, How to handle the novel during the training? How to indentify the novel proposals? How to use it for training since no GTs are included？

Thanks!!!!!!

Hello, has your question been solved? I've read it many times, but I still don't understand it

same confusion here, especially on how to get the gt for binary matching. here are some of my thoughts, for text inputs, you transform the multi-class labels by filtering the labels that do not belong to the current text, also the class id of the labels that align with the text should be set to 1. for image inputs, the decoder box predictions aligned with the proposal input are set to the matched class 1. but wouldn't that be very time-consuming during the training, especially you have multiple copies of the queries and the conditional image inputs? because there are mask r-cnn, clip and detr involved?

Hello, has your question been solved? I've read it many times, but I still don't understand it

Hi @lxtGH , Thanks for your interest in our work. For your questiones:

The symbol 'R' means the number of repeat times for our conditional inputs. For example, DETR has N=100 object queries for the Transformer decoder. For our OV-DETR, we first repeat the queries R times and get the NxR queries (e.g., 100x3=300). Then we add the conditional inputs for these NxR queries according to the equation (4). For example, the 0-99 queries are added with the CLIP image embedding of 'cat', the 100-199 queries are added with the CLIP text embedding of 'dog', etc. The symbol R is independent of the number of categories.

The ground truth depends on the conditional inputs. For example, if we add the CLIP text embedding of 'dog' as the conditional inputs, the ground truth bounding box of class 'dog' within the batch will be treated as the ground truth.

Like the ViLD, we pre-trained the model on base classes to extract the object proposals and get their CLIP image embeddings as the conditional inputs. The detector is required to predict which regions in the image matched with the conditional image features from object proposals (like the image retrieval task).

@yuhangzang Thank you for your reply! I am still bit confused about the process. This is my understanding of the matching process, correct me if I am wrong.

Let's assume there are 3 bounding box annotations in the input image {cat, dog, apple}. For each box, you first generate their CLIP embeddings, so there will be 3 embedding vectors z_cat, z_dog and z_apple.

For novel object proposals, You replace the classification layer with OV CLIP-text embedding and train on base classes. You will generate some numbers, lets say m=2 of embedding vectors z_novel. So there are totally 5 conditional input embeddings? How do you determine if a proposal is novel and what does row2 in Table 2 mean? And also, how do you handle the background class during the base training stage?

For the matching, by default there are N=100 queries. You copy them by R=3 times. Then how to allocate between NxR queries and 5 conditional input embeddings? Do you simply do even allocation? Like in this case, each z will be added to 300/5=60 queries?

The GT label for each class conditional query is simply {0, 1}. For example, the queries conditioned on cat will have only 1 getting matched to cat while the rest will not? If there are M cats in the image then M queries will get matched to cat?

Hello, has your question been solved? I've read it many times, but I still don't understand it

Yes.
Isolated training.
It's up to you. Copy only relate training effience. You can forward your model using only one class name at a time, such as cat.
No, in cost matrix, there also bbox-related loss. To be more extreme, I think the match loss here is essentially a bbox regression loss, because it does not require classification like the close set DETR. The classification process is processed in the query process, and the decoder only needs to generate a bbox based on the query.

yuhangzang / OV-DETR

Several Questions on training and inference #1