复现有点问题 - Githubissues

QuinlanD commented 3 years ago

您好，请问您在排序任务中是如何划分数据集的呢，可以分享一下您的代码吗？我下了LIVE数据，将原始图像分成三份，然后用z_task_shell/0...sh分别生成train/val/test的失真图像，将rank.txt改为label.txt。然后跑完感觉有问题。以下是打印输出 ··· 2021-01-09 18:22:36,298 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:22:39,311 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:22:47,096 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:22:48,216 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:22:48,924 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:22:52,433 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:22:59,601 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:22:59,603 main.py:204 INFO ===== lr decay rate: 0.001 -> 0.001 ===== 2021-01-09 18:22:59,616 train.py:61 INFO Training Over with lr=1.0000000000000001e-07~~ 2021-01-09 18:22:59,617 my_dataloader.py:128 INFO Using image size: [224, 224] 2021-01-09 18:23:00,854 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:23:00,855 main.py:204 INFO ===== lr decay rate: 0.001 -> 0.001 ===== 2021-01-09 18:23:00,873 train.py:61 INFO Training Over with lr=1.0000000000000001e-07~~ 2021-01-09 18:23:00,874 my_dataloader.py:128 INFO Using image size: [224, 224] 2021-01-09 18:23:01,873 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:23:01,874 main.py:204 INFO ===== lr decay rate: 0.001 -> 0.001 ===== 2021-01-09 18:23:01,883 train.py:61 INFO Training Over with lr=1.0000000000000001e-07~~ 2021-01-09 18:23:01,884 my_dataloader.py:128 INFO Using image size: [224, 224] 2021-01-09 18:23:05,027 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:23:05,028 main.py:204 INFO ===== lr decay rate: 0.001 -> 0.001 ===== 2021-01-09 18:23:05,121 train.py:61 INFO Training Over with lr=1.0000000000000001e-07~~ 2021-01-09 18:23:05,122 my_dataloader.py:128 INFO Using image size: [224, 224] 2021-01-09 18:23:05,404 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:23:05,404 main.py:176 INFO Evaluation: Acc@1 0.000 and loss 0.000. 2021-01-09 18:23:05,404 main.py:177 INFO Evaluation results: 2021-01-09 18:23:05,405 main.py:180 INFO Evaluation Over~ 2021-01-09 18:23:07,007 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:23:07,009 main.py:176 INFO Evaluation: Acc@1 0.000 and loss 0.000. 2021-01-09 18:23:07,009 main.py:177 INFO Evaluation results: 2021-01-09 18:23:07,009 main.py:180 INFO Evaluation Over~ 2021-01-09 18:23:07,434 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:23:07,435 main.py:176 INFO Evaluation: Acc@1 0.000 and loss 0.000. 2021-01-09 18:23:07,435 main.py:177 INFO Evaluation results: 2021-01-09 18:23:07,435 main.py:180 INFO Evaluation Over~ 2021-01-09 18:23:08,996 test.py:58 INFO Acc@1 0.000 and loss 0.000 with time 0.000 2021-01-09 18:23:08,997 main.py:176 INFO Evaluation: Acc@1 0.000 and loss 0.000. 2021-01-09 18:23:08,997 main.py:177 INFO Evaluation results:

TonyChenjf commented 3 years ago

请问您解决了吗

QuinlanD commented 3 years ago

请问您解决了吗

没有，您也遇到同样的问题了吗？

TonyChenjf commented 3 years ago

是的，可能是版本的问题

------------------ 原始邮件 ------------------ 发件人: "QuinlanD"<notifications@github.com>; 发送时间: 2021年1月15日(星期五) 上午9:36 收件人: "zheng-yuwei/RankIQA.PyTorch"<RankIQA.PyTorch@noreply.github.com>; 抄送: "1170904085"<1170904085@qq.com>; "Comment"<comment@noreply.github.com>; 主题: Re: [zheng-yuwei/RankIQA.PyTorch] 复现有点问题 (#4)

请问您解决了吗

没有，您也遇到同样的问题了吗？

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

fakeyee commented 3 years ago

我这边用 mobilenetv3_large 网络训练的（efficientnet-b0 我跑不起来一直报 RuntimeError: The size of tensor a (384) must match the size of tensor b (196) at non-singleton dimension 0 这种错误），test 和 val 文件夹下用的是别的图片集，train 用的是LIVE 的数据集，分别生成过label.txt文件。跑完之后生成了一个 17MB左右的 .pth 文件，转成jit之后再用demos下面的验证脚本验证了。输出的结果我没理解是什么含义： a0bkfkutkf6_11_kid.jpg: [0.19183607] a0c3e0mtkc3_11_man.jpg: [0.20544936] a0ffursj21n_47.2_adult.jpg: [0.32216233] a0bdb8qe4dr_50.5_low.jpg: [0.2767007] a0ahjorwcpv_7_normal.jpg: [0.2854018]

请问你们知道吗？能不能归一化转成可理解的分数？

TonyChenjf commented 3 years ago

最近在收集评价指标，你这边的问题我太清楚，看报错可能是输入张量不匹配模型输入，或许可以排查一下预处理模块？之前用腾讯的DVQA训练分数也不对劲，无法理解，现在也是在排查原因，正在参照原文。如果过几天我们这边有进展和收获会联系您的，互相交流一下。

------------------ 原始邮件 ------------------ 发件人: "zheng-yuwei/RankIQA.PyTorch" <notifications@github.com>; 发送时间: 2021年1月15日(星期五) 上午10:39 收件人: "zheng-yuwei/RankIQA.PyTorch"<RankIQA.PyTorch@noreply.github.com>; 抄送: "1170904085"<1170904085@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [zheng-yuwei/RankIQA.PyTorch] 复现有点问题 (#4)

我这边用 mobilenetv3_large 网络训练的（efficientnet-b0 我跑不起来一直报 RuntimeError: The size of tensor a (384) must match the size of tensor b (196) at non-singleton dimension 0 这种错误），test 和 val 文件夹下用的是别的图片集，train 用的是LIVE 的数据集，分别生成过label.txt文件。跑完之后生成了一个 17MB左右的 .pth 文件，转成jit之后再用demos下面的验证脚本验证了。输出的结果我没理解是什么含义： a0bkfkutkf6_11_kid.jpg: [0.19183607] a0c3e0mtkc3_11_man.jpg: [0.20544936] a0ffursj21n_47.2_adult.jpg: [0.32216233] a0bdb8qe4dr_50.5_low.jpg: [0.2767007] a0ahjorwcpv_7_normal.jpg: [0.2854018]

请问你们知道吗？能不能归一化转成可理解的分数？

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

fakeyee commented 3 years ago

嗯嗯好的，前面你说的依赖版本，我用的是这些，Python 3.8.5

torch==1.7.1 torchvision==0.8.2 tensorboard==2.3.0 prefetch-generator==1.0.1 Glymur==0.9.3 imgaug==0.4.0 matplotlib==3.3.2 numpy==1.19.2 opencv-python==4.5.1.48 openJPEG==2.3.0 pandas==1.2.0 Pillow==8.1.0 scikit-image==0.18.1 scikit-learn==0.24.0 scipy==1.6.0 seaborn==0.11.1

------------------ 原始邮件 ------------------ 发件人: TonyChenjf <notifications@github.com> 发送时间: 2021年1月15日 10:46 收件人: zheng-yuwei/RankIQA.PyTorch <RankIQA.PyTorch@noreply.github.com> 抄送: ying zhang <zhangying20@foxmail.com>, Comment <comment@noreply.github.com> 主题: 回复：[zheng-yuwei/RankIQA.PyTorch] 复现有点问题 (#4)

最近在收集评价指标，你这边的问题我太清楚，看报错可能是输入张量不匹配模型输入，或许可以排查一下预处理模块？之前用腾讯的DVQA训练分数也不对劲，无法理解，现在也是在排查原因，正在参照原文。如果过几天我们这边有进展和收获会联系您的，互相交流一下。

------------------ 原始邮件 ------------------ 发件人: "zheng-yuwei/RankIQA.PyTorch" <notifications@github.com>; 发送时间: 2021年1月15日(星期五) 上午10:39 收件人: "zheng-yuwei/RankIQA.PyTorch"<RankIQA.PyTorch@noreply.github.com>; 抄送: "1170904085"<1170904085@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [zheng-yuwei/RankIQA.PyTorch] 复现有点问题 (#4)

我这边用 mobilenetv3_large 网络训练的（efficientnet-b0 我跑不起来一直报 RuntimeError: The size of tensor a (384) must match the size of tensor b (196) at non-singleton dimension 0 这种错误），test 和 val 文件夹下用的是别的图片集，train 用的是LIVE 的数据集，分别生成过label.txt文件。跑完之后生成了一个 17MB左右的 .pth 文件，转成jit之后再用demos下面的验证脚本验证了。输出的结果我没理解是什么含义： a0bkfkutkf6_11_kid.jpg: [0.19183607] a0c3e0mtkc3_11_man.jpg: [0.20544936] a0ffursj21n_47.2_adult.jpg: [0.32216233] a0bdb8qe4dr_50.5_low.jpg: [0.2767007] a0ahjorwcpv_7_normal.jpg: [0.2854018]

请问你们知道吗？能不能归一化转成可理解的分数？

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Jack-old8 commented 1 year ago

在本地一直运行不了，apex包一直下不上，咋办嘞

zheng-yuwei / RankIQA.PyTorch

复现有点问题 #4