zhoubenjia / GFSLT-VLP

MIT License
42 stars 4 forks source link

Some problems during the training #14

Open mpuu00001 opened 3 months ago

mpuu00001 commented 3 months ago

Hi Authors,

Thank you for sharing your excellent work.

We encounter the following error when trying to train the visual encoder and text decoder using train_vlp_v2.py: Screenshot 2024-06-04 at 18 07 39

The root of the error appears to be in the dataloader of the train_one_epoch function. The training process stops before entering the for loop: image

We downloaded the raw dataset from https://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-2014-T/

Could you please provide suggestions for resolving this error?

mpuu00001 commented 3 months ago

Dear Authors,

I realise that if I set 'args.num_workers = 0', and then the training using train_vlp_v2.py runs without issue. If I increase 'args.num_workers' to a value bigger than 0, I get the error "_pickle.UnpicklingError: state is not a dictionary" from the dataloader. This change will however cause a slow training.

Is it because of the use of OpenCV?
Could you please provide some help?

zhoubenjia commented 2 months ago

Dear Authors,

I realise that if I set 'args.num_workers = 0', and then the training using train_vlp_v2.py runs without issue. If I increase 'args.num_workers' to a value bigger than 0, I get the error "_pickle.UnpicklingError: state is not a dictionary" from the dataloader. This change will however cause a slow training.

Is it because of the use of OpenCV? Could you please provide some help?

Hi, please provide your running command, it seems to be a problem in multi-process training.

mpuu00001 commented 2 months ago

Dear Authors,

Thanks for your reply and your great work! I used the following commands for the training:

torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp_v2.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2 --training-refurbish True --noise-rate 0.15 --noise-type omit_last --random-shuffle False

torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp.py --batch-size 6 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp

With these commands, I can run train_vlp_v2.py and train_vlp.py without a problem when args.num_workers = 0. However, when I change the corresponding num_workers to a value greater than zero, the following error occurs with either of the commands stated above.

Screenshot 2024-06-20 at 14 55 02

The error "_pickle.UnpicklingError: state is not a dictionary" comes from the dataloader. This change to set args.num_workers = 0will however cause a slow training.

Could you please provide some help?

zhoubenjia commented 2 months ago

Dear Authors,

Thanks for your reply and your great work! I used the following commands for the training:

torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp_v2.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2 --training-refurbish True --noise-rate 0.15 --noise-type omit_last --random-shuffle False

torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp.py --batch-size 6 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp

With these commands, I can run train_vlp_v2.py and train_vlp.py without a problem when args.num_workers = 0. However, when I change the corresponding num_workers to a value greater than zero, the following error occurs with either of the commands stated above. Screenshot 2024-06-20 at 14 55 02

The error "_pickle.UnpicklingError: state is not a dictionary" comes from the dataloader. This change to set args.num_workers = 0will however cause a slow training.

Could you please provide some help?

Hi, Unfortunately, you encountered an error that seems to be related to your environment configuration. I don't experience the same error when I use your command with num_workers=8. torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp_v2.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2 --training-refurbish True --noise-rate 0.15 --noise-type omit_last --random-shuffle False results: image

mpuu00001 commented 2 months ago

Dear Authors,

May I know whether you are using a Windows or Linux system?

Thank you~

zhoubenjia commented 2 months ago

Dear Authors,

May I know whether you are using a Windows or Linux system?

Thank you~

Some of my environment info is as follows:

59~20.04.1-Ubuntu SMP Thu Jun 16 21:21:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0