Open mpuu00001 opened 3 months ago
Dear Authors,
I realise that if I set 'args.num_workers = 0', and then the training using train_vlp_v2.py runs without issue. If I increase 'args.num_workers' to a value bigger than 0, I get the error "_pickle.UnpicklingError: state is not a dictionary" from the dataloader. This change will however cause a slow training.
Is it because of the use of OpenCV?
Could you please provide some help?
Dear Authors,
I realise that if I set 'args.num_workers = 0', and then the training using train_vlp_v2.py runs without issue. If I increase 'args.num_workers' to a value bigger than 0, I get the error "_pickle.UnpicklingError: state is not a dictionary" from the dataloader. This change will however cause a slow training.
Is it because of the use of OpenCV? Could you please provide some help?
Hi, please provide your running command, it seems to be a problem in multi-process training.
Dear Authors,
Thanks for your reply and your great work! I used the following commands for the training:
torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp_v2.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2 --training-refurbish True --noise-rate 0.15 --noise-type omit_last --random-shuffle False
torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp.py --batch-size 6 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp
With these commands, I can run train_vlp_v2.py
and train_vlp.py
without a problem when args.num_workers = 0
. However, when I change the corresponding num_workers
to a value greater than zero, the following error occurs with either of the commands stated above.
The error "_pickle.UnpicklingError: state is not a dictionary" comes from the dataloader.
This change to set args.num_workers = 0
will however cause a slow training.
Could you please provide some help?
Dear Authors,
Thanks for your reply and your great work! I used the following commands for the training:
torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp_v2.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2 --training-refurbish True --noise-rate 0.15 --noise-type omit_last --random-shuffle False
torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp.py --batch-size 6 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp
With these commands, I can run
train_vlp_v2.py
andtrain_vlp.py
without a problem whenargs.num_workers = 0
. However, when I change the correspondingnum_workers
to a value greater than zero, the following error occurs with either of the commands stated above.The error "_pickle.UnpicklingError: state is not a dictionary" comes from the dataloader. This change to set
args.num_workers = 0
will however cause a slow training.Could you please provide some help?
Hi,
Unfortunately, you encountered an error that seems to be related to your environment configuration. I don't experience the same error when I use your command with num_workers=8
.
torchrun --nproc_per_node=1 --master_port=1236 --master_addr=localhost train_vlp_v2.py --batch-size 4 --epochs 80 --opt sgd --lr 0.01 --output_dir out/vlp_v2 --training-refurbish True --noise-rate 0.15 --noise-type omit_last --random-shuffle False
results:
Dear Authors,
May I know whether you are using a Windows or Linux system?
Thank you~
Dear Authors,
May I know whether you are using a Windows or Linux system?
Thank you~
Some of my environment info is as follows:
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
Hi Authors,
Thank you for sharing your excellent work.
We encounter the following error when trying to train the visual encoder and text decoder using train_vlp_v2.py:
The root of the error appears to be in the dataloader of the train_one_epoch function. The training process stops before entering the for loop:
We downloaded the raw dataset from https://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-2014-T/
Could you please provide suggestions for resolving this error?