yrcong / RelTR

RelTR: Relation Transformer for Scene Graph Generation: https://arxiv.org/abs/2201.11460v2
248 stars 49 forks source link

Unable to train on a single GPU - ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError #55

Closed ikram-md closed 4 months ago

ikram-md commented 4 months ago

Hello, i am trying to train the model on Visual Genome on a single GPU. About the GPU :

Since i only want to train on the current GPU i have, i am running the following command :

python -m torch.distributed.launch --nproc_per_node=1 --use_env main.py --dataset vg --img_folder data/vg/images/ --ann_path data/vg/ --batch_size 1 --output_dir ckpt

And i am getting the following error : image image

I tried reducing the batch size to 1 ( one image per GPU ) and get the number of process per node also to be 1 and i am still getting the same error, i tried reading about the error but it's difficult to find since it depends on the case.

ikram-md commented 4 months ago

Solved the problem by starting a new virtual environment, installing cuda & pytorch from the official documentation Start locally then installing python 3.9, cython and all the necessary dependencies. The caveat is that cuda version must be exactly compatible with your GPU.

wuzhiwei2001 commented 4 weeks ago

Solved the problem by starting a new virtual environment, installing cuda & pytorch from the official documentation Start locally then installing python 3.9, cython and all the necessary dependencies. The caveat is that cuda version must be exactly compatible with your GPU.

Hello, I encountered the same issue as you did and followed your instructions for installation thoroughly. However, the problem still persists. Could you please let me know which versions of PyTorch and CUDA you selected? I would greatly appreciate your response!