BARON uses ResNet50-FPN as backbone when using CLIP as supervision, but uses ResNet50-C4 as backbone when using captions as supervision. I'm curious about why using different backbones for different supervisions. Why not use ResNet50-FPN when using captions?
Thanks for great work!
BARON uses ResNet50-FPN as backbone when using CLIP as supervision, but uses ResNet50-C4 as backbone when using captions as supervision. I'm curious about why using different backbones for different supervisions. Why not use ResNet50-FPN when using captions?