About the parameter "num_workers" in Dataloader

Landmine598 commented 3 years ago

According to my own understanding, when setting "self_train=True" meanwhile "source_sample_mode=True", the source domain needs to use "nway_kshot_dataloader", and the target domain needs to sample according to the pseudo labels. In this setting, the "num_workers" for both dataloder is set as the default value 0.

I assumed that the reason is, in order to control the same classes are selected for both domains in the current batch, multi-"num workers" will disturb the sampling frequency and cause failure in implicit alignment.

While in the training, I find that the training efficiency is seriously limited by the data loading step, especially when I tried this implicit alignment method in multi-source domain adaptation task, the data loading time cost for the multi-source domain is unbearable.

I'm wondering whether you can put forward some solutions to this problem.

Thanks a lot, and have a nice day!

xiangjjj commented 3 years ago

I'm wondering what dataset are you referring to? I didn't notice this issue with num_workers but I reduced the pseudo-label update frequency to speed up the process. If the pseudolabels are updated very frequently, it can make the code very slow.

Landmine598 commented 3 years ago

I'm wondering what dataset are you referring to? I didn't notice this issue with num_workers but I reduced the pseudo-label update frequency to speed up the process. If the pseudolabels are updated very frequently, it can make the code very slow.

Thank you for your timely apply!

I tried the multi-source DA on Office31, dslr and webcam for source, and amazon for target. . Because there is only 1 "num_ workers", each source domain data needs to be loaded in sequence, and after loading, they are concated and then input to the model. Loading data is very time-consuming, and I tried to set "num_workers=4" and it did speed up the efficiency of model training and improve the Volatile GPU-Util, but I observed that the classes for each domain in this batch is disturbed, maybe caused by the multiple processes are loading data at the same time, which can not guarantee that the source samples of the same "classes group" are in the same batch, resulting in the implicit alignment failure.

PS: I use the pseudo-label update frequency 20 as the same in your setting.

xiangjjj commented 3 years ago

Thanks for the explanation. I'm wondering do you experience slow loading when the source has a single domain, and are you observing 50% slower loading due to one additional domain for the source? Or it is much slower than a single domain pair?

Landmine598 commented 3 years ago

On my GPU(2080Ti,11G) For single source DA task (W->A), when training step=10000, it takes about 32 hours. For multi-source task(W&D->A), it takes about 46 hours. I'm wondering if the running time for single source DA is reasonable？Or I didn't set the cuda mode correct?

xiangjjj commented 3 years ago

Office31 is a small dataset and should not take more than a couple hours to train. Maybe there are some issues with the training configurations.

xiangjjj / implicit_alignment

About the parameter "num_workers" in Dataloader #6