xingyizhou / CenterNet2

Two-stage CenterNet
Apache License 2.0
1.2k stars 188 forks source link

CUDA memory usage continuously increases #77

Open vlfom opened 2 years ago

vlfom commented 2 years ago

Dear authors,

Thank you for the great work and clean code.

I am using the CenterNet2 default configuration (from Base-CenterNet2.yaml), however, when training, I observe that the memory reserved by CUDA keeps increasing until the training fails due to CUDA OOM error. When I replace the CenterNet2 with the default RPN, the issue disappears.

I tried adding gc.collect() and torch.cuda.empty_cache() to the training loop with no success.

Have you noticed such behavior in the past, or could you please provide some hints on what could be the issue? Below I also provide some reference screenshots.

Note: in my project, there are several things that differ from the abovementioned configuration: I train on 50% of COCO dataset and I use LazyConfig to initialize the model. However, I reimplemented the configuration twice and both face the same issue, so it is unlikely there is a bug in my code.

image image

(observe that memory allocation keeps increasing on both images)

costapt commented 2 years ago

Hi!

I am facing the same issue. I tried replacing the CustomCascadeROIHeads with the StandardROIHeads, trying to confirm if the problem, but the same problem persists. I have the feeling that the problem is in CenterNet, but I still was not not able to pinpoint where.

kachiO commented 2 years ago

I've encountered this issue as well. It seems to happen with the two-stage CenterNet2 models. The workaround that I've found is running the model with the following versions: detectron2=v0.6, pytorch=1.8.1, python=3.6, and cuda=11.1

costapt commented 2 years ago

Thank you! 👍 It seemed to have solved the problem here as well!