Wrong evaluation problems by libs/inference.py.

dongjunhwang commented 2 years ago

Sorry for the serveral issues that i made. and thank you for your fast answering.

Reference by wsolevaluation code, i think libs/inference.py in this repository have some problem about code indentation.

The BoxEvaluator and MaskEvaluator class has accumulate function that calculate localization score just only one class activation map. and the compute function calculate the total score using all of localization scores before we calculated by accumulate function .

So the accumulate function must be inside of the for loop. (because it calculate only one class activation map generated by one image)

But I found in libs/inference.py, the accumulate function exist the outside of the for loop like this.

for images, targets, image_ids in self.loader:
        ....
        for cam, target, predict, image, image_id in zip(cams, targets, predicts, images, image_ids):
                cam_resized = cv2.resize(cam, image_size,
                                        interpolation=cv2.INTER_CUBIC)
            ....

        # part that must fixed 
        performance={}
        if self.dataset_name == "ILSVRC":
                self.evaluator_boxes.accumulate(cam_normalized, image_id)
                gt_known = self.evaluator_boxes.compute()
                top_1 = self.evaluator_boxes.compute_top1()
                performance['gt_known'] = gt_known
                performance['top_1'] = top_1

        elif self.dataset_name == "CUB":
                if self.split == "test":
                          self.evaluator_mask.accumulate(cam_normalized, image_id)
                          pxap, iou = self.evaluator_mask.compute()
                          performance['pxap'] = pxap
                          performance['iou'] = iou

                self.evaluator_boxes.accumulate(cam_normalized, image_id)
                gt_known = self.evaluator_boxes.compute()
                top_1 = self.evaluator_boxes.compute_top1()
                performance['gt_known'] = gt_known
                performance['top_1'] = top_1

        else:
                self.evaluator_mask.accumulate(cam_normalized, image_id)
                pxap, iou = self.evaluator_mask.compute()
                performance['pxap'] = pxap
                performance['iou'] = iou

return performance

So, I think the code changes like below. Am I provide right solution about the inference.py?

for images, targets, image_ids in self.loader:
        ....
        for cam, target, predict, image, image_id in zip(cams, targets, predicts, images, image_ids):
            cam_resized = cv2.resize(cam, image_size,
                                    interpolation=cv2.INTER_CUBIC)
            ....
            if self.dataset_name == "ILSVRC":
                  self.evaluator_boxes.accumulate(cam_normalized, image_id)

            elif self.dataset_name == "CUB":
                  if self.split == "test":
                      self.evaluator_mask.accumulate(cam_normalized, image_id)
                  self.evaluator_boxes.accumulate(cam_normalized, image_id)

            else:
                  self.evaluator_mask.accumulate(cam_normalized, image_id)

performance={}
if self.dataset_name == "ILSVRC":
        gt_known = self.evaluator_boxes.compute()
        top_1 = self.evaluator_boxes.compute_top1()
        performance['gt_known'] = gt_known
        performance['top_1'] = top_1

elif self.dataset_name == "CUB":
        if self.split == "test":
                  pxap, iou = self.evaluator_mask.compute()
                  performance['pxap'] = pxap
                  performance['iou'] = iou

        gt_known = self.evaluator_boxes.compute()
        top_1 = self.evaluator_boxes.compute_top1()
        performance['gt_known'] = gt_known
        performance['top_1'] = top_1

else:
        pxap, iou = self.evaluator_mask.compute()
        performance['pxap'] = pxap
        performance['iou'] = iou

return performance

And the other important thing is that this change the score of the method. Because current code only evaluate the one last image, and differently, fixed code evaluate the all of the images on dataset.

For example, the CUB (Resnet Backbone use the checkpoint #1 ) has the 71.35 score in current code, but fixed code generate 69.87.

Thanks.

zh460045050 commented 2 years ago

Thank you for reporting this crucial issue. We have already revised it and retested all our checkpoints. Fortunately, it does not much influence the evaluation scores. Corresponding results will be updated soon.

zh460045050 commented 2 years ago

We already have updated the scores of our models on these datasets. Thank you again for reporting this issue.

zh460045050 / DA-WSOL_CVPR2022

Wrong evaluation problems by libs/inference.py. #2