The results are too low!!!

Zhang-HM commented 4 years ago

I re-run the code you provided, the mAp of Pascal VOC2007 is 42.2%, but 51.5 in your paper.

And in the first phase, the shot is set to 200?

tonysy commented 4 years ago

Same problem, while I get a 47.5 mAP with the example in the readme.

AP for bird = 0.415
AP for bus = 0.480
AP for cow = 0.033
AP for motorbike = 0.010
AP for sofa = 0.112
Mean AP = 0.4752

In your paper, the results for the first split is

AP for bird = 0.525
AP for bus = 0.559
AP for cow = 0.527
AP for motorbike = 0.546
AP for sofa = 0.416
Mean AP = 0.515

the performance gap is huge, please provide the reproducible code for facilating the future research. Thanks

Hxx2048 commented 4 years ago

Hello，my results: AP for bird = 0.525 AP for bus = 0.400 AP for cow = 0.241 AP for mbike = 0.542 AP for sofa = 0.115

XiongweiWu commented 4 years ago

@Hxx2048 @tonysy In my side, I can obtain higher results on 5-shot(50.0% vs 45.7%) and close results on 10-shot(51.1% vs 51.5%). I spot the main reason is the training image sets of stage 2 are randomly generated and thus, the results may have some variance.

XiongweiWu commented 4 years ago

BTW, for 1-shot, the std deviation in my side is extremely high, about 3%(from min 9.1% to max 22.3%, average 15.3%, reported 19.9%), which means current problem setting on single few-shot detection task may not well evaluate the model. Maybe more future work can be done on multiple tasks.

tonysy commented 4 years ago

If the randomness matters a lot, it's difficult to verify the effectiveness of the method.

yanxp commented 4 years ago

As there are variances of the novel samples of phase 2, we evaluate the model five times to get the average results as our paper.

tonysy commented 4 years ago

@yanxp Would you like to share the sampled data split(.pkl) list to make the experiments reproducible?

Hxx2048 commented 4 years ago

I evaluate the model five times(10 shots): time| bird | bus | cow | mbike | sofa | mean 1 | 53.4 | 41.2 | 24.1 | 53.6 | 15.9 | 37.6 2 | 32.2 | 40.9 | 17.4 | 0.37 | 18.7 | 22.6 3 | 41.2 | 12.9 | 41.4 | 29.7 | 12.9 | 27.6 4 | 49.7 | 52.1 | 51.0 | 33.9 | 11.5 | 39.6 5 | 52.5 | 40.0 | 24.1 | 54.2 | 11.5 | 36.5

YoungXIAO13 commented 4 years ago

@Hxx2048 @tonysy In my side, I can obtain higher results on 5-shot(50.0% vs 45.7%) and close results on 10-shot(51.1% vs 51.5%). I spot the main reason is the training image sets of stage 2 are randomly generated and thus, the results may have some variance.

I think the main difference between the mAP of our re-implementations and that of the paper might not come from the randomness in data generation. If that's really the case as you said, two or three percent would still be acceptable, but ~40% vs 51.5% is not a small margin.

I've trained several times and I got similar results as @Hxx2048 , which shows a large margin compared to the reported results in the paper.

Another thing to notice is that the Mean AP printed in screen at testing is the mAP for all classes, I confused it with the mAP for novel classes at the first time.

XiongweiWu commented 4 years ago

@YoungXIAO13 From other reported issue, seem the author has modified sth about data sampling which impacts the final results. I run the code very early in mid Oct and it seems the code has been modified now.

Hxx2048 commented 4 years ago

From the issue #19 "Results un-reproducible after your revision on Nov 6, 2019", i found this original implementation code.

The changes are in lib/roi_data_layer/roidb.py line64 filter_class_roidb( ). I modified code back to original version, and get the close results on 10-shot(51.1% vs 51.5%) like @XiongweiWu . results for now code: time| bird | bus | cow | mbike | sofa | mean 1 | 53.4 | 41.2 | 24.1 | 53.6 | 15.9 | 37.6 2 | 32.2 | 40.9 | 17.4 | 0.37 | 18.7 | 22.6 3 | 41.2 | 12.9 | 41.4 | 29.7 | 12.9 | 27.6 4 | 49.7 | 52.1 | 51.0 | 33.9 | 11.5 | 39.6 5 | 52.5 | 40.0 | 24.1 | 54.2 | 11.5 | 36.5

results for original code:(I just tried once) time| bird | bus | cow | mbike | sofa | mean 1 | 53.6 | 62.8 | 56.0 | 62.1 | 20.8 | 51.1

results in paper: bird | bus | cow | mbike | sofa | mean 52.5 | 55.9| 52.7 | 54.6| 41.6 | 51.5

Although the results are a little different in some categories(20.8 % sofa with 41.6% sofa) than they were in the paper, the mean AP(51.1%) are close to the results in the paper(51.5%).

As issue #16 said, there is a bug in the function filter_class_roidb(roidb, shot, imdb) in line 59 of file MetaRCNN/lib/roi_data_layer/roidb.py in original code. So the author changed original code to current version. The author thought the original code is equivalent with modified one, but obviously the results of paper were un-reproducible in the newest versions like issue #19 said.

tonysy commented 4 years ago

@Hxx2048 Thanks for your efforts. From my perspective, I think the author need to verify their methods with the bug-fixed filter_class_roidb, report the numerical results and provide the data split file (.pkl) for verification .

JeyesHan commented 4 years ago

my result on 10-shot: AP for bird = 0.525 AP for bus = 0.158 AP for cow = 0.539 AP for motorbike = 0.454 AP for sofa = 0.029 Mean AP = 0.5197 (all classes) Mean AP = 0.341 (novel classes) It seems the bug reported before where the author didn't notice it booted the result.

yanxp commented 4 years ago

Hello, We use our original filter_class_roidb code to run and get the 10-shot is 51.7 . And I upload the original filter_class_roidb implementation of our code. And the 10-shot.pkl is sampled the shots of second phase. The attention.pkl is the 10-shot class attention vectors. There is indeed variance to run the code and we also run a few times to get the sampled k-shots to get our results in our paper. Thanks!

zr526799544 commented 3 years ago

Same problem, while I get a 47.5 mAP with the example in the readme.
AP for bird = 0.415
AP for bus = 0.480
AP for cow = 0.033
AP for motorbike = 0.010
AP for sofa = 0.112
Mean AP = 0.4752
In your paper, the results for the first split is
AP for bird = 0.525
AP for bus = 0.559
AP for cow = 0.527
AP for motorbike = 0.546
AP for sofa = 0.416
Mean AP = 0.515
the performance gap is huge, please provide the reproducible code for facilating the future research. Thanks

could you tell me how to build the dataset to rerun the code ?

yanxp / MetaR-CNN

The results are too low!!! #27