training hangs every few seconds

Thanks for sharing your implementation for this awesome work!

I was trying to reproduce some of the exp results and I found that my training process was very slow. I noticed the training hang after every few iterations for seconds and then resumed. I tried tuning "number_of_workers" which didn't help in general. Reducing or increasing the number of GPUs doesn't resolve the issue either. When I monitored the gpu memory usage, I found when the training process hang, the memory usage is 0% for all gpus. I suspect there is something wrong with the dataloader but I cannot pin down the cause. Any comment is appreciated!

Environment: Ubuntu 18.04 Single node with 8 Tesla V100 GPUs Pytorch 1.13 CUDA 11.6

Sample command I used for running exp

cd VL-T5/
bash scripts/image/multiple_adapters.sh 8

Edit: It turns out the bottleneck is on the reading of clip features which were saved in hdf5 format. I have to convert all clip features to ".npy" format and it gives ~7x speed up for data loading.

ylsung / VL_adapter

training hangs every few seconds #12