Table 4 and Table 5 using COCO pretraining or not?

wjf5203 / VNext

Next-generation Video instance recognition framework on top of Detectron2 which supports InstMove (CVPR 2023), SeqFormer(ECCV Oral), and IDOL(ECCV Oral))

Apache License 2.0

602 stars 53 forks source link

Table 4 and Table 5 using COCO pretraining or not? #39

Open superaha opened 2 years ago

superaha commented 2 years ago

Hi there,

Thank you for sharing the repo. In Table 3, the results of YOUTUBE-VIS 2019 are reported using both models with and without the COCO pretraining.

How about Table 4 and Table 5 for IDOL? I did not find the detailed settings and explanations for these two results.

Thanks

timmeinhardt commented 2 years ago

I have asked the same question in a different issue. This line

https://github.com/wjf5203/VNext/blob/d41c4a1b35e894df0ea90cce76f5e828d9a29da1/projects/IDOL/configs/ovis_r50.yaml#L3

seems to suggest that they used a model pretrained on COCO sequences. But I would appreciate a clarification of the others as well!

superaha commented 2 years ago

Thanks for pointing this out. Let us see if the authors can clarify. @wjf5203

wjf5203 commented 2 years ago

Hi,

Thanks for your attention and pointing this out.

Let me clarify this. We have at most three training steps for IDOL:

Step 1: pre-training the instance segmentation pipeline on COCO, following all other VIS methods. Step 2: pre-training IDOL on pseudo key-reference pair from COCO. (This step forces the model to learn a position-insensitive contrastive embedding that relies on appearance of the object rather than the spatial position.) Step 3: finetune our VIS method IDOL on VIS dataset (YTVIS19/YTVIS21/OVIS), following all other VIS methods.

So, the main difference is Step 2. In Table 3,4,5, all the IDOL results marked with $\dagger$ are obtained by Step 1+2+3, others without $\dagger$ are obtained by Step 1+3.

We will add more detailed experimental settings in the next arXiv version ~

HanGuangXin commented 2 years ago

@wjf5203 Hi, so there are 2 steps of pre-train. The first step is on single frame with static COCO images. The second step is on pseudo key-reference pairs.

And I have 3 questions about this:

Why do we have to do pre-train on static COCO images first? Why just using the second is not enough?
The provided pre-trained weights is from the second step, not the first step? If so, could you provide the trained weights of the first step?
How can I do the first step pre-training by myself? It seems the code only support the second and the third steps.