zhufengx / SRN_multilabel

143 stars 39 forks source link

computing mAP #5

Closed zizhaozhang closed 6 years ago

zizhaozhang commented 6 years ago

Hi Feng

I am wondering if you could help me about the evaluation mAP on COCO. I found your code using a VOC style average precision for each class, but I am not sure the intuition behind this computation method. I follow the standard mAP method used by here, here, and here. I follow the same way with yours to filter top-3 or top-all with prob > 0.5. And run your provided predicted label.txt on coco and I got exact values. So my way to filtering the label is correct.

I am wondering if you could provide more information about your way to compute mAP and is it the same as the CNN-RNN paper?

Appreciated!

zhufengx commented 6 years ago

Hi, @zizhaozhang

I can provide following information about evaluation metrics if I haven't misunderstanding your question.

(1) By using VOC style mAP, we formulate the multi-label prediction problem as multiple image retrieval tasks, with each task retrieving images tagged with one specific label. AP is calculated for each retrieval task, and then averaged over all label classes. APs calculated for label classes could help figure out which labels are difficult to identify.

(2) We didn't filter predictions using "top-3" constraints or thresholds before calculating mAP. The raw predicted label probabilities are used in mAP metric.

(3) Our mAP is different from "MAP@10" provided in CNN-RNN, so the evaluation values cannot be directly compared. In my understanding, CNN-RNN calculates "AP@10" for each image, and then average the values over all images. This is a key difference between "MAP@10" of CNN-RNN and our mAP.

zizhaozhang commented 6 years ago

@zhufengx. Thanks for replying. I agree that AP should be computed on each class to get AP, and mAP is obtained by averaging APs over classes. I think the difference is how to compute AP. I think the correct way to compute AP is following this intuition. The other two references I cited above also use this intuition. But from your code AP_VOC, I think the way to compute AP is different.

But anyway, thanks for your explanation. I probably will insist on F1 as I found later papers also do not use this metric, such as this

zhufengx commented 6 years ago

Hi, @zizhaozhang, thanks for pointing out this problem. I agree that there are some different ways for computing AP. Your referred links follow the original definition of AP, while VOC-2012 style AP calculates the AUC of a refined Recall-Precision curve. Both ways have been widely used.

Since we have provided reference predictions on all three datasets, you can evaluate them with any metrics if you want to fairly compare with our method.