Can you share the detection performance (mAP@0.5, 0.25 and AR@0.5, 0.25) of your model? (without grounding, captioning head), or can you share the training command for training the detection branch only?
I'm sorry I've been busy recently and haven't tested it. I tried the results of 3DJCG in the test set for dense captioning here and may be able to provide some references.
Can you share the detection performance (mAP@0.5, 0.25 and AR@0.5, 0.25) of your model? (without grounding, captioning head), or can you share the training command for training the detection branch only?