zhengchen1999 / DAT

PyTorch code for our ICCV 2023 paper "Dual Aggregation Transformer for Image Super-Resolution"
Apache License 2.0
386 stars 37 forks source link

DDP expects same model across all ranks #29

Open tahir0khalil opened 7 months ago

tahir0khalil commented 7 months ago

Hi,

I am trying to train DAT model on my custom dataset and have made all the required changes in .yml file. I have added the data in the designated directories but when i give it the command to start training it spends quite a lot of time once the following message is displayed: INFO: Network [DAT] is created.

Then I get the following error message and training fails. Kindly let me know how can I fix this issue. I am running the model in inference mode with pretrained models on my custom data and it works perfectly.

PS: I am training on 4 3090 GPUs.

image image image

zhengchen1999 commented 7 months ago

It seems there is a problem with DDP. This should be due to GPU-pytorch. You can try reinstalling a new environment and install pytorch+cuda separately: pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html Also, comment out lines 12 and 13 in the requirements.txt.