yoctta / XPaste

34 stars 3 forks source link

Memory leak in CenterNet? #1

Open DavideA opened 1 year ago

DavideA commented 1 year ago

Hi, and thank you for releasing your code.

I am trying to replicate your results, and when training the detector I get an out of memory error. Specifically, it seems that the code logs increasing memory usage.

I think the issue might be in CenterNet. If I replace the model with a Resnet50-MaskRCNN provided by detectron2, I do not observe it.

Did you ever experience this? And more in general, on how many GPUs and how much memory was needed to train your models?

Z-MU-Z commented 1 year ago

Hi ! I am also trying to reproduce the experimental results and have encountered the same issue. I executed the following command: bash launch.sh --config-file configs/Base-C2_L_R5021k_640b64_4x.yaml On four A100, is this the baseline in the paper? The log shows max_mem growing from 9071M to 76358M and eventually throwing an error.

Z-MU-Z commented 1 year ago

I think this link may help. https://github.com/xingyizhou/CenterNet2/issues/77

DavideA commented 1 year ago

Hi! The memory leak seems to be coming from the CenterNet code and can be fixed by using the most recent version of the CenterNet repo. Or, as I did, integrate this commit.

I trained the base model using configs/Base-C2_L_R5021k_640b64_4x.yaml, got slightly worse results than what reported in the paper.

Z-MU-Z commented 1 year ago

Thanks for your helpful suggestions. But now I meet a new problem. I use configs/Xpaste_R50.yaml or XPaste/configs/Xpaste_copypaste_R50.yaml to train the model on 4 A100 GPUs, the training process is very slow(compared to the baseline configs/Base-C2_L_R5021k_640b64_4x.yaml). The log shows that it takes eta: 6 days, 0:40:49. However when I use configs/Xpaste_copypaste_swinL.yaml or configs/Xpaste_swinL.yaml, the training process is very fast. The log shows that it takes eta: 1 day, 23:53:33. More strangely, when using configs/Xpaste_R50.yaml or XPaste/configs/Xpaste_copypaste_R50.yaml, the max_mem shown in log.txt “max_mem: 17492M” is different from what is displayed in nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... Off | 00000000:16:00.0 Off | 0 | | N/A 33C P0 87W / 400W | 73357MiB / 81920MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... Off | 00000000:49:00.0 Off | 0 | | N/A 37C P0 92W / 400W | 56595MiB / 81920MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... Off | 00000000:8A:00.0 Off | 0 | | N/A 35C P0 85W / 400W | 27125MiB / 81920MiB | 100% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... Off | 00000000:8F:00.0 Off | 0 | | N/A 33C P0 71W / 400W | 75597MiB / 81920MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ I wonder do you meet the same problem? I am looking forward to your reply.