Open superaha opened 2 years ago
I have asked the same question in a different issue. This line
seems to suggest that they used a model pretrained on COCO sequences. But I would appreciate a clarification of the others as well!
Thanks for pointing this out. Let us see if the authors can clarify. @wjf5203
Hi,
Thanks for your attention and pointing this out.
Let me clarify this. We have at most three training steps for IDOL:
Step 1: pre-training the instance segmentation pipeline on COCO, following all other VIS methods. Step 2: pre-training IDOL on pseudo key-reference pair from COCO. (This step forces the model to learn a position-insensitive contrastive embedding that relies on appearance of the object rather than the spatial position.) Step 3: finetune our VIS method IDOL on VIS dataset (YTVIS19/YTVIS21/OVIS), following all other VIS methods.
So, the main difference is Step 2. In Table 3,4,5, all the IDOL results marked with $\dagger$ are obtained by Step 1+2+3, others without $\dagger$ are obtained by Step 1+3.
We will add more detailed experimental settings in the next arXiv version ~
@wjf5203 Hi, so there are 2 steps of pre-train. The first step is on single frame with static COCO images. The second step is on pseudo key-reference pairs.
And I have 3 questions about this:
Hi there,
Thank you for sharing the repo. In Table 3, the results of YOUTUBE-VIS 2019 are reported using both models with and without the COCO pretraining.
How about Table 4 and Table 5 for IDOL? I did not find the detailed settings and explanations for these two results.
Thanks