Problems occur when training multiple GPUs in parallel.

lzzppp commented 8 months ago

Hello Team Leader,

I've encountered an issue while using your project for distributed training where the GPU utilization is very high, but the video memory usage does not reach the expected amount, leading to the program getting stuck. I've followed the setup instructions provided in your documentation and am running the training script with the following configuration:

CUDA_VISIBLE_DEVICES=4,5,6,7 ./tools/dist_train.sh ./configs/rotated_faster_rcnn/rotated_faster_rcnn_r50_fpn_1x_dota_le90.py 4

Environment:

OS: Ubuntu 22.04.2 LTS Python version: 3.7.16 PyTorch version: 1.7.0 CUDA version: 11.0

Pip list:

Package                Version     Editable project location
---------------------- ----------- ------------------------------
addict                 2.4.0
aliyun-python-sdk-core 2.15.0
aliyun-python-sdk-kms  2.16.2
anykeystore            0.2
apex                   0.9.10.dev0
certifi                2022.12.7
cffi                   1.15.1
charset-normalizer     3.3.2
click                  8.1.7
colorama               0.4.6
crcmod                 1.7
cryptacular            1.6.2
cryptography           42.0.5
cycler                 0.11.0
Cython                 3.0.9
dataclasses            0.6
defusedxml             0.7.1
e2cnn                  0.2.3
filelock               3.12.2
flit_core              3.6.0
fonttools              4.38.0
fsspec                 2023.1.0
future                 1.0.0
GPUtil                 1.4.0
greenlet               3.0.3
huggingface-hub        0.16.4
hupper                 1.12.1
idna                   3.6
importlib-metadata     6.7.0
jmespath               0.10.0
kiwisolver             1.4.5
Markdown               3.4.4
markdown-it-py         2.2.0
MarkupSafe             2.1.5
matplotlib             3.5.3
mdurl                  0.1.2
mkl-fft                1.0.14
mkl-random             1.0.4
mkl-service            2.3.0
mmcv                   1.5.3
mmcv-full              1.6.1
mmdet                  2.25.1
mmrotate               0.3.4       /data3/chenqiyuan/lzp/mmrotate
model-index            0.1.11
mpmath                 1.3.0
numpy                  1.21.6
nvidia-ml-py           12.535.77
nvitop                 1.3.0
oauthlib               3.2.2
openai                 0.27.8
opencv-python          4.9.0.80
opendatalab            0.0.10
openmim                0.3.9
openxlab               0.0.10
ordered-set            4.1.0
oss2                   2.17.0
packaging              24.0
pandas                 1.3.5
PasteDeploy            3.1.0
pbkdf2                 1.3
Pillow                 9.3.0
pip                    22.3.1
plaster                1.1.2
plaster-pastedeploy    1.0.1
platformdirs           4.0.0
pycocotools            2.0.7
pycparser              2.21
pycryptodome           3.20.0
Pygments               2.17.2
pyparsing              3.1.2
pyramid                2.0.2
pyramid-mailer         0.15.1
python-dateutil        2.9.0.post0
python3-openid         3.2.0
pytz                   2023.4
PyYAML                 6.0.1
repoze.sendmail        4.4.1
requests               2.28.2
requests-oauthlib      2.0.0
rich                   13.7.1
safetensors            0.4.2
schedule               1.2.0
scipy                  1.7.3
setuptools             60.2.0
six                    1.16.0
SQLAlchemy             2.0.29
sympy                  1.10.1
tabulate               0.9.0
termcolor              2.3.0
terminaltables         3.1.10
timm                   0.9.12
tomli                  2.0.1
torch                  1.7.0+cu110
torchaudio             0.7.0
torchvision            0.8.1+cu110
tqdm                   4.65.2
transaction            4.0
translationstring      1.4
typing_extensions      4.7.1
urllib3                1.26.18
velruse                1.1.1
venusian               3.1.0
WebOb                  1.8.7
wheel                  0.38.4
WTForms                3.0.1
wtforms-recaptcha      0.3.2
yapf                   0.40.2
zipp                   3.15.0
zope.deprecation       5.0
zope.interface         6.2
zope.sqlalchemy        3.1

Expected Behavior: I expected the training process to utilize the GPU memory more efficiently and for the training to proceed without getting stuck.

Actual Behavior: The training process gets stuck indefinitely with high GPU utilization and low video memory usage.

Additional Context:

I have verified that a single GPU can be normal. I also tried adjusting the DataLoader's batch size and number of workers, but the problem persists. Can you provide any insights or suggestions on how to resolve this issue? I'm wondering if there are any configuration steps I've overlooked, or if there are known compatibility issues with the given version of PyTorch and CUDA.

Thanks for your time and help.

Best Regards, Zepeng Li

yuhongtian17 commented 8 months ago

It is indeed a serious problem but unfortunately I cannot locate the cause for you. I have read your environment configuration and there should be no compatibility issues, or maybe:

using the latest 4090 GPU, which requires you to use CUDA version>=11.6.
some pip package conflicts that I do not know, since I use conda to manage my envs.
firmware damage of GPU parallel module. Please contact your server administrator or other team partners.

Finally, if you have problems configuring the official mmrotate, you can also report this issue under their GitHub.

lzzppp commented 8 months ago

Yes, I do use a 4090 GPU. Yesterday I repeatedly modified the environment, including changing the internal code of some environmental restrictions, and now it can run. Unfortunately, I don't know why.

BiangBiangH commented 4 months ago

@lzzppp @yuhongtian17 Hi, bro. I want to ask you a question that I only have two 4090 GPU. Are they enough for training the STD model? I am looking forward to your reply

lzzppp commented 4 months ago

@BiangBiangH It depends on the amount of data. If the amount of data is not large, only less than one thousand pictures, two 4090s are enough.

BiangBiangH commented 4 months ago

@lzzppp ok thank you!

yuhongtian17 / Spatial-Transform-Decoupling

Problems occur when training multiple GPUs in parallel. #4