Closed buxiangzhiren closed 6 months ago
We did not conduct experiments with the Video-Swin-Base without pre-training on RefCOCO. So sorry that I cannot answer your question. It's possible that the small size of Ref-Youtube-VOS is the main factor.
thanks!
Thank you for sharing such excellent work. I would like to ask if you have tested the Video Swin Transformer Base as a backbone on the Ref-Youtube-VOS dataset without pretraining on RefCOCO? The results I obtained using your code seem to be similar to those with Video Swin Tiny.
I'm unsure of the cause. It's possible there are some bugs, or the Ref-Youtube-VOS dataset might be too small for effectively fine-tuning the Video Swin Transformer Base.
Thank you for your attention!