Range of predicted labels when using published checkpoints + coherence with training code

ggaziv commented 4 years ago

I implemented a VOC-style mAP evaluation (following toandaominh1997/EfficientDet.Pytorch).

Then, it appears like the empty class names in coco.yml mix-up things and corrupted AP for many classes, using the published checkpoint:

mAP:
person: 0.7163819236158591
bicycle: 0.5161784001945379
car: 0.4937064741075016
motorcycle: 0.6830735390930684
airplane: 0.9021500970078673
bus: 0.7359536147712323
train: 0.8431693275922107
truck: 0.42524955112648233
boat: 0.41918798074727315
traffic light: 0.36277670804285234
fire hydrant: 0.7571938297429218
stop sign: 0.0
parking meter: 0.00011820330969267139
bench: 0.0
bird: 0.0
cat: 5.500550055005501e-05
dog: 0.0008903918908175638
horse: 0.00015318627450980392
sheep: 1.326224768573778e-05
cow: 0.001399473579257586
elephant: 4.360718646432932e-05
bear: 0.0
zebra: 0.0
giraffe: 8.264091162984444e-05
backpack: 0.0
umbrella: 0.0
handbag: 0.0061549167777084305
tie: 0.0
suitcase: 0.0
frisbee: 0.0
skis: 0.0
snowboard: 0.0
sports ball: 0.0
kite: 2.569835273558965e-05
baseball bat: 0.00015644635054589055
baseball glove: 9.792401096748924e-05
skateboard: 0.0
surfboard: 6.620247830476727e-05
tennis racket: 0.000867401126918968
bottle: 1.0557934023470287e-06
wine glass: 0.0
cup: 0.0
fork: 0.0
knife: 2.1036514957415183e-05
spoon: 0.0
bowl: 3.3721353710023335e-06
banana: 2.9897153790959103e-06
apple: 0.0
sandwich: 7.5029449058755554e-06
orange: 0.0
broccoli: 1.8853695324283557e-05
carrot: 0.0
hot dog: 0.0
pizza: 2.750880281690141e-05
donut: 6.405001024800164e-06
cake: 9.138261902586128e-06
chair: 0.0
couch: 0.0
potted plant: 0.0
bed: 0.0
dining table: 0.0005959339836060795
toilet: 1.0700234014117889e-06
tv: 1.0458500669344042e-05
laptop: 0.0
mouse: 0.0
remote: 0.0
keyboard: 2.0118007314362196e-05
cell phone: 0.0
microwave: 0.0
oven: 0.0
toaster: 0.0
sink: 0.0
refrigerator: 0.0
book: 5.335780678071009e-06
clock: 0.00012022013210209856
vase: 0.0
scissors: 0.0
teddy bear: 0.0
hair drier: 0.0
toothbrush: 0.0
avg mAP: 0.08582496006665566

I fix this easily by finetuning from that checkpoint just a little bit, and evaluating again:

mAP:
person: 0.6705238636559865
bicycle: 0.48276996481959494
car: 0.43395402235268116
motorcycle: 0.6533560449532532
airplane: 0.7128173767673481
bus: 0.7141445087586875
train: 0.8085567397644697
truck: 0.41230201211796785
boat: 0.3726112619999363
traffic light: 0.2731919843158005
fire hydrant: 0.7686382288007776
stop sign: 0.6474742058282337
parking meter: 0.5462075288808406
bench: 0.27220704698951637
bird: 0.36663957322572927
cat: 0.7978577523724735
dog: 0.6520410576238584
horse: 0.7223033272252359
sheep: 0.6231026395744703
cow: 0.5675314836052323
elephant: 0.8308615250022378
bear: 0.7626138912906535
zebra: 0.8652572583556355
giraffe: 0.8387080957165396
backpack: 0.17268088035649581
umbrella: 0.5173319656205155
handbag: 0.18959284603089657
tie: 0.34610891895078344
suitcase: 0.4711031956138419
frisbee: 0.643777824063373
skis: 0.2841603887725633
snowboard: 0.31219270127573295
sports ball: 0.24030119226715077
kite: 0.38279276633555315
baseball bat: 0.46739378784281427
baseball glove: 0.4248652329262107
skateboard: 0.6751596679648146
surfboard: 0.4925158572887858
tennis racket: 0.7012750305011906
bottle: 0.4026828283265537
wine glass: 0.3610220874776069
cup: 0.4145329934314569
fork: 0.3363045641607002
knife: 0.15031093806040377
spoon: 0.1471982982807306
bowl: 0.4178757844532799
banana: 0.23026322110229352
apple: 0.10102389162592623
sandwich: 0.17512185453756932
orange: 0.1413954887857534
broccoli: 0.207151473395644
carrot: 0.02735277104737069
hot dog: 0.16107822366635616
pizza: 0.15717480778803072
donut: 0.3059019886399552
cake: 0.3536832170912464
chair: 0.39839342184055726
couch: 0.5042983818750569
potted plant: 0.38410903082874404
bed: 0.5848806895847063
dining table: 0.3658962213504868
toilet: 0.8105900184619292
tv: 0.6328853719893284
laptop: 0.6730282208613219
mouse: 0.6936599915898881
remote: 0.20275091537742887
keyboard: 0.613134167024717
cell phone: 0.32889939589020045
microwave: 0.6129961903366123
oven: 0.5363460680729832
toaster: 0.0
sink: 0.5227943370975559
refrigerator: 0.6664909528613769
book: 0.15523130692771275
clock: 0.6366486162544338
vase: 0.3814498727978479
scissors: 0.40205697773382104
teddy bear: 0.5515679926141651
hair drier: 0.018181818181818184
toothbrush: 0.24811622625135288
avg mAP: 0.4516425533435351

I think this suggests that the published checkpoint is one that does not predict labels in range 0-79 (which is the case for CocoDataset), but at the wider range according the length of obj_list in coco.yml.

Could this be clarified?

zylo117 commented 4 years ago

I scanned though his code, and found the threshold is 0.4 by default, which is higher than recommended 0.05. So the mAP is lower of course. But I saw there are lot os zeros. It makes no sense, I think your implement slight changes something.

Here's a suggestion, you can debug the evaluation on the same image with different scripts, and see if their results are consistent

ggaziv commented 4 years ago

From what I see, the default value you are referring to is not wired to anything, and the values are as you mention (here). Also, I did substantial changes to fix it with your code.

It still raises something important to understand about your code: you set the number of classes/FC output to be len(obj_list) (which is 90 labels) however the CocoDataset does coco_label->label mapping , i.e., label values for model are 0-79. It means that you have additional dummy units there 80-89.

On the other hand, if this was the case then finetuning your coco checkpoint as I did would have not changed anything. But as posted above - it did ! So it seems as if the published checkpoint does not fit the current training code - else why the drastic change when finetuning with the same dataset?

Supporting this, check out these validation plots when finetuning the coco checkpoint (on CocoDataset). They show a brief improvement in classification (and not regression). I think it might exactly correspond to this correction of labeling range.

zylo117 commented 4 years ago

I'm confused here.

What does finetuning have to do with evaluation?
You don't need a mapping when you fill the empty classes with blanks. It will not affect mAP unless false prediction to blank. It's a risk, but the chance is quite small.
The training code won't fit the pretrained weights, of course. It is for finetuning from backcbone or training on a custom dataset, instead of a fine-trained weights which can almost no longer be improved.
Also, check out here. His model is initialized with the threshold 0.4 by default, if you didn't change it, it is 0.4. https://github.com/toandaominh1997/EfficientDet.Pytorch/blob/fbe56e58c9a2749520303d2d380427e5f01305ba/eval.py#L372 https://github.com/toandaominh1997/EfficientDet.Pytorch/blob/fbe56e58c9a2749520303d2d380427e5f01305ba/models/efficientdet.py#L73
Despite what others do, is there a bug in coco_eval.py? Can you please point it out if there is one?

ggaziv commented 4 years ago

Finetuning is a way to adjust the weights to new labeling of a dataset. Clearly, if I finetune on corrupted labeling it will be reflect in evaluation. In the same way, if I fix the labeling and finetune I can fix poor evaluation.

Here I performed the naive experiment of finetuning on the same coco dataset (without change of labeling) and expected to find no impact in evaluation - but this wasn't the case (see above). Why doesn't the training code fit the pretrained weights? Wasn't it used to create them?

Also, I found your new fix for coco category id mismatch - could this be related?

You are right on point (4) but I do not consider his model, but yours (not using his code as is):

model = EfficientDetBackbone(compound_coef=compound_coef, num_classes=len(obj_list),
                                    ratios=eval(params['anchors_ratios']), scales=eval(params['anchors_scales']))

zylo117 commented 4 years ago

yes, coco class id mismatching is fixed. but it won't affect anything but training on coco

ggaziv commented 4 years ago

Great - so now I can expect that the training code fits the pretrained weights? Namely, trying to finetune on coco from the pretrained weights should not significantly alter evaluation correct?

zylo117 commented 4 years ago

yes, if the lr is low enough.

ggaziv commented 4 years ago

Was only fixed when set annotation[0, 4] = a['category_id'] - 1 in https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/3716ff4a133ea36359ef7ec088fd0968335fd9a7/efficientdet/dataset.py#L76

zylo117 commented 4 years ago

fixed, thanks

zylo117 / Yet-Another-EfficientDet-Pytorch

Range of predicted labels when using published checkpoints + coherence with training code #212