zyl1336110861 commented 5 years ago

我训练的设置情况如下：batchsize为12，GPU为2080Ti，跑了16个epoch，但是结果都偏大一点，文献的指标和数据（我的测试数据）分别是：>1px=8.03（8.42）,>2px=4.47（4.75）,>3px=3.30（3.56）,EPE=0.765（0.864）请问这是为什么呢？想咨询一下前辈的意见，麻烦您啦！

zyl1336110861 commented 5 years ago

@xy-guo

xy-guo commented 5 years ago

请问你改过什么代码么？测试用的是原代码么？你的测试数据指的是？

xy-guo commented 5 years ago

这个问题跟你遇到的问题一样么？ #7

zyl1336110861 commented 5 years ago

我代码没有改过，只不过根据我自身的文件改变了sh文件的路径，我运行了sceneflow.sh文件，如下：

!/usr/bin/env bash

set -x DATAPATH="/data/Yongli_data/" CUDA_VISIBLE_DEVICES=4,5,6,7,8,9 python main.py --dataset sceneflow \ --datapath $DATAPATH --trainlist ./filenames/sceneflow_train.txt --testlist ./filenames/sceneflow_test.txt \ --epochs 16 --lrepochs "10,12,14,16:2" \ --model gwcnet-gc --logdir ./checkpoints_second/sceneflow/gwcnet-gc --resume

然后，我的测试用的是您提供的原代码，测试数据用的是sceneflow的数据集，整理结构按照这个来的https://github.com/feihuzhang/GANet 最后是按照原代码运行16个epoch，每一个epoch分别由一个training和一个testing过程（这个算是validation吧？）组成，我说的测试结果是看的代码16个epoch运行完了以后的print结果，不知道是哪里出了问题，测试数据偏大，想请教一下前辈。

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月15日(星期一) 中午1:13 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

请问你改过什么代码么？测试用的是原代码么？你的测试数据指的是？

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

zyl1336110861 commented 5 years ago

好像他做的是kitti的测试，我先做的是sceneflow的测试，不太一样欸。

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月15日(星期一) 中午1:15 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

这个问题跟你遇到的问题一样么？ #7

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

xy-guo commented 5 years ago

首先为了得到跟我论文一样的结果建议用我的测试代码进行测试，我的测试去除了一些dipsarity偏大的图片，并且在有效区间做的平均。以及可以试一下用4个GPU训练看看结果，因为默认BN是没有sync的，每张卡的batch太小可能也会降低结果，不知道这个影响有多大。

zyl1336110861 commented 5 years ago

前辈您好，我用的是您发布的原代码，所以我的测试的话是不是也就有了您说的那种数据集处理操作？，我直接用sh sceneflow.sh命令运行，不用对原代码做更改就行是吧？

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月15日(星期一) 中午1:29 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

首先为了得到跟我论文一样的结果建议用我的测试代码进行测试，我的测试去除了一些dipsarity偏大的图片，并且在有效区间做的平均。以及可以试一下用4个GPU训练看看结果，因为默认BN是没有sync的，每张卡的batch太小可能也会降低结果，不知道这个影响有多大。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

zyl1336110861 commented 5 years ago

对啦，我每个卡跑2个batchsize,一共用了6张卡。

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月15日(星期一) 中午1:29 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

首先为了得到跟我论文一样的结果建议用我的测试代码进行测试，我的测试去除了一些dipsarity偏大的图片，并且在有效区间做的平均。以及可以试一下用4个GPU训练看看结果，因为默认BN是没有sync的，每张卡的batch太小可能也会降低结果，不知道这个影响有多大。

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

xy-guo commented 5 years ago

不需要改代码。你可以先用4张卡试一下，或者看看能不能用syncbn

zyl1336110861 commented 5 years ago

好的好的，谢谢前辈！

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月15日(星期一) 中午1:35 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

不需要改代码。你可以先用4张卡试一下，或者看看能不能用syncbn

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

zyl1336110861 commented 5 years ago

欸？前辈不好意思，再打扰一下，我一开始是设置的batchsize为12，一张卡跑2个batchsize已经是它的极限了，一张卡跑不了3个batchsize.

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月15日(星期一) 中午1:35 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

不需要改代码。你可以先用4张卡试一下，或者看看能不能用syncbn

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

xy-guo commented 5 years ago

是8G显存是么？

zyl1336110861 commented 5 years ago

很不凑巧，我们的显卡是2080Ti，11G显存。

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月15日(星期一) 中午1:45 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

是8G显存是么？

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

xy-guo commented 5 years ago

抱歉我好像记错了，当时我是batch16 八张卡跑的结果

zyl1336110861 commented 5 years ago

@xy-guo ,我尝试了您推的其他人的解决办法，即将插值选项中aligncorner改成False，以及将batchsize改为8用四张卡来跑，貌似都没有什么改善，不知道什么原因。在sceneflow上pretrain的结果输出总是大0.4个百分点。

xy-guo commented 5 years ago

在kitti上能测得论文中的结果吗？以及你用batch_size 12的详细参数可以列出来吗？包括如何修改的batch_size和test_batch_size

zyl1336110861 commented 5 years ago

还没有做这一步，我打算分别用您提供的权重文件和我训练的权重做一下kitti上的validation试试看能不能实现，谢谢前辈，ps我正在用8个GPU跑一下16个batchsize，目前跑了3个效果并没有改善。

------------------ 原始邮件 ------------------ 发件人: "Xiaoyang Guo"notifications@github.com; 发送时间: 2019年7月17日(星期三) 中午1:56 收件人: "xy-guo/GwcNet"GwcNet@noreply.github.com; 抄送: "1336110861"1336110861@qq.com;"Author"author@noreply.github.com; 主题: Re: [xy-guo/GwcNet] 郭前辈您好，请教一下关于sceneflow数据集训练结果 (#9)

在kitti上能测得论文中的结果吗？

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

zyl1336110861 commented 5 years ago

这是sh文件中的配置

!/usr/bin/env bash

set -x DATAPATH="/data/Yongli_data/" CUDA_VISIBLE_DEVICES=4,5,6,7,8,9 python main.py --dataset sceneflow \ --datapath $DATAPATH --trainlist ./filenames/sceneflow_train.txt --testlist ./filenames/sceneflow_test.txt \ --epochs 16 --lrepochs "10,12,14,16:2" \ --model gwcnet-gc --logdir ./checkpoints_sixth/sceneflow/gwcnet-gc #--resume main函数中： parser = argparse.ArgumentParser(description='Group-wise Correlation Stereo Network (GwcNet)') parser.add_argument('--model', default='gwcnet-g', help='select a model structure', choices=models.keys()) parser.add_argument('--maxdisp', type=int, default=192, help='maximum disparity')

parser.add_argument('--dataset', required=True, help='dataset name', choices=datasets.keys()) parser.add_argument('--datapath', required=True, help='data path') parser.add_argument('--trainlist', required=True, help='training list') parser.add_argument('--testlist', required=True, help='testing list')

parser.add_argument('--lr', type=float, default=0.001, help='base learning rate') parser.add_argument('--batch_size', type=int, default=12, help='training batch size') parser.add_argument('--test_batch_size', type=int, default=8, help='testing batch size') parser.add_argument('--epochs', type=int, required=True, help='number of epochs to train') parser.add_argument('--lrepochs', type=str, required=True, help='the epochs to decay lr: the downscale rate')

parser.add_argument('--logdir', required=True, help='the directory to save logs and checkpoints') parser.add_argument('--loadckpt', help='load the weights from a specific checkpoint') parser.add_argument('--resume', action='store_true', help='continue training the model') parser.add_argument('--seed', type=int, default=1, metavar='S', help='random seed (default: 1)')

parser.add_argument('--summary_freq', type=int, default=20, help='the frequency of saving summary') parser.add_argument('--save_freq', type=int, default=1, help='the frequency of saving checkpoint')

结果如下：

Epoch 15/16, Iter 543/547, test loss = 0.100, time = 0.605390 Epoch 15/16, Iter 544/547, test loss = 0.105, time = 0.593975 Epoch 15/16, Iter 545/547, test loss = 0.208, time = 0.628453 Epoch 15/16, Iter 546/547, test loss = 0.245, time = 0.314822 avg_test_scalars {'loss': 0.3329115998837586, 'D1': [0.029726512310430542], 'EPE': [0.86220250307315], 'Thres1': [0.08460966610031329], 'Thres2': [0.04767845512131123], 'Thres3': [0.035727261786963754]}

xy-guo commented 5 years ago

我这边试着跑一下batch12 6gpu的结果吧，你可以先确认下kitti checkpoint的结果对不对，以及训练数据的数量以及是否文件都存在。

zyl1336110861 commented 5 years ago

恩恩好的好的，我再检查一下，感谢前辈！

zyl1336110861 commented 5 years ago

前辈您好！我那个sceneflow数据集上的训练结果和您论文上面的对上啦！原因仍然是您提供的那个人（#7）的解决方案，将None改为False就可以啦。用6个GPU跑也可以得出一样的结果！非常感谢！但是，我目前有点疑惑的地方是： 1）您那个best.ckpt文件是怎么finetune得到的，我用在sceneflow上面训练的模型在kitti2012数据集上面训练，结果也会大一点（D1会大0.5个百分点）在kitti2015上训练的话，中间的某些epoch可以得到和论文一致的结果，这个best.ckpt文件是不是选取的相对来说比较符合训练集的中间epoch的结果？ 2）不知道您batchsize和epoch设置的是多少呢？还是16和300？ 3）finetune是不是同一个pretrain得到的模型先后在2012和2015两个数据集上做呢？还是只做一个就行啦。实在不好意思打扰您，但是确实是有些疑惑。

xy-guo commented 5 years ago

1)论文中对如何finetune有一些解释，因为val集太小，所以结果震荡比较厉害，为了提交kitti online测试，所以选一个val最好对checkpoint。 2) batchsize 16 epoch 300 3) finetune 要分别finetune

zyl1336110861 commented 5 years ago

谢谢前辈啦！十分感谢，您的解答给了我很大帮助！

Sarah20187 commented 5 years ago

@xy-guo PSMNet 提交的模型我记得是在kitti的整个trainingset上训练1000个epoch，请问您用于提交的模型只是用了一部分的training数据吗？留了一小部分做validation set？

xy-guo commented 5 years ago

Yes. You can find all the validation set used in my experiments in filenames folder.

pinkteae commented 3 years ago

前辈您好，不好意思打扰您了，我用您的代码在sceneflow上训练之后用kitti2015微调，但是在当训练完一个epoch进行val时，第14/20的val总出现错误，显示：the shape of the mask[1,383,1244]at index 1 does not match the shape of the indexed tensor[1,384,1248] at index 1。想请教您这是什么原因呢？

xy-guo / GwcNet

郭前辈您好，请教一下关于sceneflow数据集训练结果 #9

!/usr/bin/env bash

!/usr/bin/env bash