Discrepancies in CLIPIQA and MUSIQ Scores When Testing ResShift on RealSR65

Guaishou74851 commented 4 months ago

Hi @zsyOAOA,

I am experiencing inconsistencies in the evaluation metrics while testing ResShift with the RealSR65 dataset. Below is a detailed description of my process and the issues encountered:

Data Verification and Command Execution:
- Confirmed the presence of the dataset in ./testdata/RealSet65.
- Ran the ResShift inference using the following command:
```
CUDA_VISIBLE_DEVICES=0 python inference_resshift.py -i testdata/RealSet65 -o result/RealSet65 --scale 4 --task realsrx4 --chop_size 512
```
Evaluation Metrics Assessment:
- Utilized IQA-PyTorch for computing CLIPIQA and MUSIQ metrics.
- Obtained the following results for the RealSR65 dataset:
```
CLIPIQA: 0.6418642669916153 (expected 0.6537)
MUSIQ: 58.211212921142575 (expected 61.330)
```
- Additionally, I observed these results for another subset of RealSR:
```
CLIPIQA: 0.5409876523911953 (expected 0.5958)
MUSIQ: 53.28555391311645 (expected 59.873)
```
Issue and Inquiry:
- Despite varying the random seed with the --seed option, the scores did not align with the reported values.
- This discrepancy persists across different datasets and metrics, prompting me to question if a step was missed or executed incorrectly.

Questions:

Could there be an oversight in my testing methodology or a specific procedure I should follow?
Is evaluating CLIPIQA and MUSIQ on the Y channel necessary or recommended for accurate results?

I am keen to understand and rectify these discrepancies and would greatly appreciate your insights.

Thank you for your assistance.

zsyOAOA commented 4 months ago

In this repo, I release an enhanced checkpoint trained for a long time. This enhanced version obtains better visual results but the CLIPIQA and MUSIQ metrics decrease slightly. After the ECCV deadline, I will upload the checkpoint to reproduce the results in our paper.

Guaishou74851 commented 4 months ago

Hello @zsyOAOA,

I recently conducted tests using the ResShift model on the ImageNet-Test dataset you provided here. I am pleased to share that the results closely align with the reported values, reinforcing the model's reliability. Below are the specific metrics I obtained:

PSNR: Calculated on the Y channel of the YCbCr space using the calculate_psnr function in utils/util_image.py.
- My Result: 25.061538989403044
- Reported Result: 25.01
SSIM: Evaluated on the Y channel of the YCbCr space using the calculate_ssim function in utils/util_image.py.
- My Result: 0.6782901566840513
- Reported Result: 0.677
LPIPS: Evaluated in the RGB formate using the IQA-PyTorch library.
- My Result: 0.2239309353015075
- Reported Result: 0.231
CLIPIQA: Evaluated in the RGB formate using the IQA-PyTorch library.
- My Result: 0.6003170353670915
- Reported Result: 0.592
MUSIQ: Evaluated in the RGB formate using the IQA-PyTorch library.
- My Result: 54.31855493927002
- Reported Result: 53.660

The results for the ImageNet-Test dataset are satisfactory and align well with the reported figures, which is commendable.

In response to your previous communication, I am eager and optimistic about running the code with the newly uploaded checkpoint to replicate the results documented in your paper. Your efforts in maintaining transparency and reproducibility are much appreciated.

Best regards, Bin Chen

zsyOAOA / ResShift

Discrepancies in CLIPIQA and MUSIQ Scores When Testing ResShift on RealSR65 #45