Open justin-hpcnt opened 4 years ago
Hello,
Yes, I used the same repo and pre-trained weight for evaluation.
For original deeplab v2 evaluation, they use the original size of images and labels from the dataset. In our evaluation, we use the size of generated images (256x256) as input size to the deeplab model. The label maps are resized to 256x256 by nearest neighbor interpolation to match the size of generated images. This is a difference between our evaluation and original deeplab v2 and might be a reason why our evaluation score is slightly higher.
Moreover, I think it might happen that the scores of generated images are slightly higher than the scores of the same model on real images. Because of the label noises in the evaluation set, the real images may not be strictly aligned with label maps. However, the generated images are directly generated from the label maps, so there's no such label noise issue. A higher mIoU and pixel accuracy score only mean that the images aligns better with the groundtruth segmentation map, but doesn't mean they are more realistic than the real images.
Hi, @xh-liu I have a question about the quantitative evaluation. For Cityscapes dataset, I run evaluation scripts and the result is that mIoU=65.1, accuracy=93.9, FID=53.53. It has some difference with results shown in your paper, especially the accuracy is obviously higher than your 82.3. Do you have any setting about the segmentation scripts?
@ZzzackChen That's wired. I just tested the model again and it's still 82.3 pixel accuracy. I use the model and code from https://github.com/fyu/drn. The calculation of pixel accuracy is not provided in the code. How did you implement it?
@xh-liu
# Mean pixel accuracy
acc = np.diag(hist).sum() / (hist.sum() + 1e-12)
# Per class accuracy
cl_acc = np.diag(hist) / (hist.sum(1) + 1e-12)
# Per class IoU
iu = np.diag(hist) / (hist.sum(1) + hist.sum(0) - np.diag(hist) + 1e-12)
@xh-liu Thanks a lot! Now I can reproduce results :D
@ZzzackChen If you ignore 255 labels then the result will be 93 as you calculated. If you count 255 in the result will be 82.3. To keep consistent with the SPADE paper (https://arxiv.org/pdf/1903.07291.pdf) I chose the second calculation method for CityScapes dataset. For COCO-Stuff and ADE datasets, pixel accuracy calculation is included in the evaluation code, and I used the calculation method in the original code.
@xh-liu Thank you ! Now I got it!
Hi, I found that in the original paper, FID for Cityscapes dataset is 71.8 instead of 53.53 as you report, how about this wired result?
@wjbKimberly The FID for Cityscapes is 54.3 reported in our paper. 71.8 is the FID score reported in the SPADE paper (https://arxiv.org/abs/1903.07291).
@justin-hpcnt Do you know how to train on 8 GPUs? Thanks a lot.
@ZzzackChen If you ignore 255 labels then the result will be 93 as you calculated. If you count 255 in the result will be 82.3. To keep consistent with the SPADE paper (https://arxiv.org/pdf/1903.07291.pdf) I chose the second calculation method for CityScapes dataset. For COCO-Stuff and ADE datasets, pixel accuracy calculation is included in the evaluation code, and I used the calculation method in the original code.
@xh-liu How to count 255 in the result when choosing the second calculation(DRN) method for CityScapes dataset? Thanks.
Hi,
Thank you sharing the code and replying my previous question! While reproducing the metrics, I have some questions:
I'm referring SPADE issue to implement evaluation code. Did you use same repo and pre-trained weight for evaluation?
If so, in regards to the COCO-Stuff dataset, original deeplab v2 shows 66.8 pixel accuracy and 39.1 mIoU score for ground truth validation images. However, CC-FPSE reaches 70.7 pixel accuracy and 41.6 mIoU score, which seems weird. I think the difference might come from the different input size to the deeplab model. How did you feed inputs to the deeplab network? (for example, use 256x256 image or upsampling 256x256 image to 321x321 with bilinear interpolation)