Thanks for sharing your implementation for this awesome work!
I was trying to reproduce some of the exp results and I found that my training process was very slow. I noticed the training hang after every few iterations for seconds and then resumed. I tried tuning "number_of_workers" which didn't help in general. Reducing or increasing the number of GPUs doesn't resolve the issue either. When I monitored the gpu memory usage, I found when the training process hang, the memory usage is 0% for all gpus. I suspect there is something wrong with the dataloader but I cannot pin down the cause. Any comment is appreciated!
Environment:
Ubuntu 18.04
Single node with 8 Tesla V100 GPUs
Pytorch 1.13
CUDA 11.6
Sample command I used for running exp
cd VL-T5/
bash scripts/image/multiple_adapters.sh 8
Edit:
It turns out the bottleneck is on the reading of clip features which were saved in hdf5 format. I have to convert all clip features to ".npy" format and it gives ~7x speed up for data loading.
Thanks for sharing your implementation for this awesome work!
I was trying to reproduce some of the exp results and I found that my training process was very slow. I noticed the training hang after every few iterations for seconds and then resumed. I tried tuning "number_of_workers" which didn't help in general. Reducing or increasing the number of GPUs doesn't resolve the issue either. When I monitored the gpu memory usage, I found when the training process hang, the memory usage is 0% for all gpus. I suspect there is something wrong with the dataloader but I cannot pin down the cause. Any comment is appreciated!
Environment: Ubuntu 18.04 Single node with 8 Tesla V100 GPUs Pytorch 1.13 CUDA 11.6
Sample command I used for running exp
Edit: It turns out the bottleneck is on the reading of clip features which were saved in hdf5 format. I have to convert all clip features to ".npy" format and it gives ~7x speed up for data loading.