Open 1049451037 opened 1 year ago
As the output of ViT also contains position information, if we directly feed embeddings of visual concept region into MLP to prediction bounding box, will model just learn to output trivial position transformation?
As the output of ViT also contains position information, if we directly feed embeddings of visual concept region into MLP to prediction bounding box, will model just learn to output trivial position transformation?