Describe our benchmark - Githubissues

The rough idea is that (A standard):

Researchers are allowed to use any type of textual prediction models. However, the training set of the detector is limited to MIRFlickr1M. They can use methods to mine a subset of paired UGC tag and image data from MIRFlickr1M (e.g., we form our Flickr200K training set). Finally, all studies report mAP@.5 on the VOC07 test set.

The better model can better utilize the 1M data.
Researchers can provide novel ways to mine the textual labels, but they cannot crowdsource to get the ground-truth annotation.

yekeren / Cap2Det

Describe our benchmark #9