Open tahir0khalil opened 7 months ago
It seems there is a problem with DDP. This should be due to GPU-pytorch. You can try reinstalling a new environment and install pytorch+cuda separately:
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
Also, comment out lines 12 and 13 in the requirements.txt.
Hi,
I am trying to train DAT model on my custom dataset and have made all the required changes in .yml file. I have added the data in the designated directories but when i give it the command to start training it spends quite a lot of time once the following message is displayed:
INFO: Network [DAT] is created.
Then I get the following error message and training fails. Kindly let me know how can I fix this issue. I am running the model in inference mode with pretrained models on my custom data and it works perfectly.
PS: I am training on 4 3090 GPUs.