代码在nvgesture数据集上训练不佳

Lilyma7019 commented 3 months ago

作者您好，感谢您出色的工作。我用您的代码从头开始训练nvgesture RGB模态，但是训练效果一直在82左右，并不能跑到文章中的89。数据集使用的是您给的百度网盘中下载的，GPU为V100-32GB，训练命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4

Lilyma7019 commented 3 months ago

log20240712-174806.txt 训练log如上

zhoubenjia commented 3 months ago

作者您好，感谢您出色的工作。我用您的代码从头开始训练nvgesture RGB模态，但是训练效果一直在82左右，并不能跑到文章中的89。数据集使用的是您给的百度网盘中下载的，GPU为V100-32GB，训练命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4

Hi，请尝试设置 --sample-duration 64 。

Lilyma7019 commented 3 months ago

作者您好，感谢您出色的工作。我用您的代码从头开始训练nvgesture RGB模态，但是训练效果一直在82左右，并不能跑到文章中的89。数据集使用的是您给的百度网盘中下载的，GPU为V100-32GB，训练命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4

Hi，请尝试设置 --sample-duration 64 。

您好，我将‘--sample-duration’设为64后，训练效果依然不尽如人意，在84左右。我已将训练log上传，训练命令如下： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4 log20240713-114915.txt

Lilyma7019 commented 3 months ago

作者您好，我用您的代码从头开始训练nvgesture RGB模态。第一次的命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4 最终的结果为82.549

第二次的命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4 最终的结果为84.1177

第三次的命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 8 最终的结果为84.9206

第四次的命令为： CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 8 最终的结果为82.9365

前三次都是在一张V100-32GB上训练，第四次是在两张A40-48GB上训练。数据集是“https://github.com/damo-cv/MotionRGBD/issues/6”这个问题中您提供的链接中下载的。但是，无论帧数选32还是64，都无法跑到您论文中的结果。是我的训练方式存在什么问题吗？

zhoubenjia commented 3 months ago

作者您好，我用您的代码从头开始训练nvgesture RGB模态。第一次的命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4 最终的结果为82.549

第二次的命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4 最终的结果为84.1177

第三次的命令为： CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 8 最终的结果为84.9206

第四次的命令为： CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 8 最终的结果为82.9365

前三次都是在一张V100-32GB上训练，第四次是在两张A40-48GB上训练。数据集是“https://github.com/damo-cv/MotionRGBD/issues/6”这个问题中您提供的链接中下载的。但是，无论帧数选32还是64，都无法跑到您论文中的结果。是我的训练方式存在什么问题吗？

Hi, 你试试： CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 --use_env train.py --config config/NvGesture.yml \ --data /mnt/Data/datasets/Motion/nvgesture/ \ --splits /mnt/Data/datasets/Motion/nvgesture/dataset_splits/rgb/ \ --save ./output_dir/NV-64-M-batch_size=8 \ --batch-size 4 --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.2 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./output_dir/NTU-RGBD-32-DTNV2-M-TSM/model_best.pth.tar

Lilyma7019 commented 3 months ago

作者您好，我用您的代码从头开始训练nvgesture RGB模态。第一次的命令为： > CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4 最终的结果为82.549 第二次的命令为： > CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 4 最终的结果为84.1177 第三次的命令为： > CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 32 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 8 最终的结果为84.9206 第四次的命令为： > CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 --use_env train.py --config config/NvGesture.yml --data ./my_dataset/ --splits ./my_dataset/dataset_splits/rgb/ --save ./output_dir/NV-TSM-M --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.0 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM.pth.tar --batch-size 8 最终的结果为82.9365 前三次都是在一张V100-32GB上训练，第四次是在两张A40-48GB上训练。数据集是“https://github.com/damo-cv/MotionRGBD/issues/6”这个问题中您提供的链接中下载的。但是，无论帧数选32还是64，都无法跑到您论文中的结果。是我的训练方式存在什么问题吗？

Hi, 你试试： CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 --use_env train.py --config config/NvGesture.yml \ --data /mnt/Data/datasets/Motion/nvgesture/ \ --splits /mnt/Data/datasets/Motion/nvgesture/dataset_splits/rgb/ \ --save ./output_dir/NV-64-M-batch_size=8 \ --batch-size 4 --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.2 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./output_dir/NTU-RGBD-32-DTNV2-M-TSM/model_best.pth.tar

好的，我用了您给的命令‘CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_port=1234 --use_env train.py --config ./config/NvGesture.yml --data ./my_datasets/ --splits ./data/dataset_splits/NvGesture/rgb/ --save ./output_dir/NV-64-M-batch_size=8 --batch-size 4 --sample-duration 64 --opt sgd --lr 0.01 --sched cosine --smprob 0.2 --mixup 0.001 --shufflemix 0.2 --epochs 100 --distill 0.0 --type M --intar-fatcer 2 --finetune ./Checkpoints/NTU-RGBD-32-DTNV2-M-TSM/model_best.pth.tar’，跑出来的结果仍然在82左右，GPU为两张V100-32GB。由于其他配置和您的一模一样，我猜测是不是随机种子的问题，不知道您的随机种子使用的是默认的123吗。如果是123的话，我真的很费解，为什么无法复现您论文中的数据。

zhoubenjia commented 3 months ago

123的话，我真的很费解，为什么无法复现您论文中的数据。

我觉得问题在于 --finetune，当设置--sample-duration 64时，应该加载NTU-RGBD-64-DTNV2-M-TSM/model_best.pth.tar 预训练模型。

Lilyma7019 commented 3 months ago

123的话，我真的很费解，为什么无法复现您论文中的数据。

我觉得问题在于 --finetune，当设置--sample-duration 64时，应该加载NTU-RGBD-64-DTNV2-M-TSM/model_best.pth.tar 预训练模型。您提供的谷歌网盘中并没有‘NTU-RGBD-64-DTNV2-M-TSM’的模型权重，不知道您是否愿意将该权重分享给我

zhoubenjia commented 3 months ago

123的话，我真的很费解，为什么无法复现您论文中的数据。

我觉得问题在于 --finetune，当设置--sample-duration 64时，应该加载NTU-RGBD-64-DTNV2-M-TSM/model_best.pth.tar 预训练模型。您提供的谷歌网盘中并没有‘NTU-RGBD-64-DTNV2-M-TSM’的模型权重，不知道您是否愿意将该权重分享给我

Hi，64帧的权重我后续找到后分享给你，但是我有最近一次记录的32帧的log文件，你可以尝试比较一下有什么区别之处？我是在RTX 4090上训练的。 [log20240202-040244.txt](https://github.com/user-attachments/files/16263244/log20240202-040244.txt)

zhoubenjia / MotionRGBD-PAMI

代码在nvgesture数据集上训练不佳 #11