Open Zhang-HM opened 4 years ago
Same problem, while I get a 47.5 mAP with the example in the readme.
AP for bird = 0.415
AP for bus = 0.480
AP for cow = 0.033
AP for motorbike = 0.010
AP for sofa = 0.112
Mean AP = 0.4752
In your paper, the results for the first split is
AP for bird = 0.525
AP for bus = 0.559
AP for cow = 0.527
AP for motorbike = 0.546
AP for sofa = 0.416
Mean AP = 0.515
the performance gap is huge, please provide the reproducible code for facilating the future research. Thanks
Hello,my results: AP for bird = 0.525 AP for bus = 0.400 AP for cow = 0.241 AP for mbike = 0.542 AP for sofa = 0.115
@Hxx2048 @tonysy In my side, I can obtain higher results on 5-shot(50.0% vs 45.7%) and close results on 10-shot(51.1% vs 51.5%). I spot the main reason is the training image sets of stage 2 are randomly generated and thus, the results may have some variance.
BTW, for 1-shot, the std deviation in my side is extremely high, about 3%(from min 9.1% to max 22.3%, average 15.3%, reported 19.9%), which means current problem setting on single few-shot detection task may not well evaluate the model. Maybe more future work can be done on multiple tasks.
If the randomness matters a lot, it's difficult to verify the effectiveness of the method.
As there are variances of the novel samples of phase 2, we evaluate the model five times to get the average results as our paper.
@yanxp Would you like to share the sampled data split(.pkl) list to make the experiments reproducible?
I evaluate the model five times(10 shots): time| bird | bus | cow | mbike | sofa | mean 1 | 53.4 | 41.2 | 24.1 | 53.6 | 15.9 | 37.6 2 | 32.2 | 40.9 | 17.4 | 0.37 | 18.7 | 22.6 3 | 41.2 | 12.9 | 41.4 | 29.7 | 12.9 | 27.6 4 | 49.7 | 52.1 | 51.0 | 33.9 | 11.5 | 39.6 5 | 52.5 | 40.0 | 24.1 | 54.2 | 11.5 | 36.5
@Hxx2048 @tonysy In my side, I can obtain higher results on 5-shot(50.0% vs 45.7%) and close results on 10-shot(51.1% vs 51.5%). I spot the main reason is the training image sets of stage 2 are randomly generated and thus, the results may have some variance.
I think the main difference between the mAP of our re-implementations and that of the paper might not come from the randomness in data generation. If that's really the case as you said, two or three percent would still be acceptable, but ~40% vs 51.5% is not a small margin.
I've trained several times and I got similar results as @Hxx2048 , which shows a large margin compared to the reported results in the paper.
Another thing to notice is that the Mean AP printed in screen at testing is the mAP for all classes, I confused it with the mAP for novel classes at the first time.
@YoungXIAO13 From other reported issue, seem the author has modified sth about data sampling which impacts the final results. I run the code very early in mid Oct and it seems the code has been modified now.
From the issue #19 "Results un-reproducible after your revision on Nov 6, 2019", i found this original implementation code.
The changes are in lib/roi_data_layer/roidb.py line64 filter_class_roidb( ). I modified code back to original version, and get the close results on 10-shot(51.1% vs 51.5%) like @XiongweiWu . results for now code: time| bird | bus | cow | mbike | sofa | mean 1 | 53.4 | 41.2 | 24.1 | 53.6 | 15.9 | 37.6 2 | 32.2 | 40.9 | 17.4 | 0.37 | 18.7 | 22.6 3 | 41.2 | 12.9 | 41.4 | 29.7 | 12.9 | 27.6 4 | 49.7 | 52.1 | 51.0 | 33.9 | 11.5 | 39.6 5 | 52.5 | 40.0 | 24.1 | 54.2 | 11.5 | 36.5
results for original code:(I just tried once) time| bird | bus | cow | mbike | sofa | mean 1 | 53.6 | 62.8 | 56.0 | 62.1 | 20.8 | 51.1
results in paper: bird | bus | cow | mbike | sofa | mean 52.5 | 55.9| 52.7 | 54.6| 41.6 | 51.5
Although the results are a little different in some categories(20.8 % sofa with 41.6% sofa) than they were in the paper, the mean AP(51.1%) are close to the results in the paper(51.5%).
As issue #16 said, there is a bug in the function filter_class_roidb(roidb, shot, imdb) in line 59 of file MetaRCNN/lib/roi_data_layer/roidb.py in original code. So the author changed original code to current version. The author thought the original code is equivalent with modified one, but obviously the results of paper were un-reproducible in the newest versions like issue #19 said.
@Hxx2048 Thanks for your efforts. From my perspective, I think the author need to verify their methods with the bug-fixed filter_class_roidb
, report the numerical results and provide the data split file (.pkl
) for verification .
my result on 10-shot: AP for bird = 0.525 AP for bus = 0.158 AP for cow = 0.539 AP for motorbike = 0.454 AP for sofa = 0.029 Mean AP = 0.5197 (all classes) Mean AP = 0.341 (novel classes) It seems the bug reported before where the author didn't notice it booted the result.
Hello, We use our original filter_class_roidb code to run and get the 10-shot is 51.7 . And I upload the original filter_class_roidb implementation of our code. And the 10-shot.pkl is sampled the shots of second phase. The attention.pkl is the 10-shot class attention vectors. There is indeed variance to run the code and we also run a few times to get the sampled k-shots to get our results in our paper. Thanks!
Same problem, while I get a 47.5 mAP with the example in the readme.
AP for bird = 0.415 AP for bus = 0.480 AP for cow = 0.033 AP for motorbike = 0.010 AP for sofa = 0.112 Mean AP = 0.4752
In your paper, the results for the first split is
AP for bird = 0.525 AP for bus = 0.559 AP for cow = 0.527 AP for motorbike = 0.546 AP for sofa = 0.416 Mean AP = 0.515
the performance gap is huge, please provide the reproducible code for facilating the future research. Thanks
could you tell me how to build the dataset to rerun the code ?
I re-run the code you provided, the mAp of Pascal VOC2007 is 42.2%, but 51.5 in your paper.
And in the first phase, the shot is set to 200?