xingyizhou / CenterNet2

Two-stage CenterNet
Apache License 2.0
1.21k stars 188 forks source link

GPU memory is not fully used. #73

Closed shoxa-mir closed 2 years ago

shoxa-mir commented 2 years ago

I have "cuda out of memory error" whereas in pytorch uses only 60% of my GPU memory. Can you help me with this issue? My GPU is RTX3090 wtih 24GB VRAM. However model only uses up to 15-16GB.

xingyizhou commented 2 years ago

Hi, I haven't met this issue. Can you trace down where/ which line of the code triggered the out-of-memory issue? Also remark that the memory usage outputted in the training log is not the real memory usage. Please use "nvidia-smi" to check the actual memory used by the program.

shoxa-mir commented 2 years ago

Got it thank you

GunjanPatel10 commented 2 years ago

@Shoxa-Mir Could you help me to solve this issue. Traceback (most recent call last): File "train_net.py", line 236, in launch( File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/detectron2/engine/launch.py", line 82, in launch main_func(*args) File "train_net.py", line 223, in main do_train(cfg, model, resume=args.resume) File "train_net.py", line 128, in do_train loss_dict = model(data)

Over Here i am facing an error.

shoxa-mir commented 2 years ago

@GunjanPatel10 Can you include error message itself also?

I also wanted to know your environment information? Have you installed detectron2 from facebookresearch and added CenterNet2 folder later? I had some issues when I install detectron2 before putting CenterNet2 folder into "detectron2/projects/" folder. May be you can try to reinstall the model.

GunjanPatel10 commented 2 years ago

@Shoxa-Mir now i am facing new error eta: 0:01:22 iter: 20 total_loss: 1.907 loss_cls_stage0: 0.1733 loss_box_reg_stage0: 0 loss_cls_stage1: 0.1186 loss_box_reg_stage1: 0 loss_cls_stage2: 0.07879 loss_box_reg_stage2: 0 loss_centernet_loc: 0.9385 loss_centernet_agn_pos: 0.4728 loss_centernet_agn_neg: 0.007027 time: 1.0148 data_time: 0.0102 lr: 0.0031102 max_mem: 2128M Traceback (most recent call last): File "train_net.py", line 236, in launch( File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/detectron2/engine/launch.py", line 82, in launch main_func(args) File "train_net.py", line 223, in main do_train(cfg, model, resume=args.resume) File "train_net.py", line 128, in do_train loss_dict = model(data) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 154, in forward features = self.backbone(images.tensor) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/keb-pg/Thesis/CenterNet2-master/projects/CenterNet2/centernet/modeling/backbone/bifpn.py", line 374, in forward x = self.cell(x) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/keb-pg/Thesis/CenterNet2-master/projects/CenterNet2/centernet/modeling/backbone/bifpn.py", line 275, in forward x = self.fnode(x) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/keb-pg/Thesis/CenterNet2-master/projects/CenterNet2/centernet/modeling/backbone/bifpn.py", line 61, in forward x.append(module(x)) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/keb-pg/Thesis/CenterNet2-master/projects/CenterNet2/centernet/modeling/backbone/bifpn.py", line 89, in forward x = self.conv(x) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/detectron2/layers/wrappers.py", line 106, in forward x = F.conv2d( RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = False torch.backends.cudnn.deterministic = False torch.backends.cudnn.allow_tf32 = True data = torch.randn([2, 160, 24, 32], dtype=torch.float, device='cuda', requires_grad=True) net = torch.nn.Conv2d(160, 160, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1) net = net.cuda().float() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams data_type = CUDNN_DATA_FLOAT padding = [1, 1, 0] stride = [1, 1, 0] dilation = [1, 1, 0] groups = 1 deterministic = false allow_tf32 = true input: TensorDescriptor 0x564629e7fd10 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 2, 160, 24, 32, strideA = 122880, 768, 32, 1, output: TensorDescriptor 0x56462a190240 type = CUDNN_DATA_FLOAT nbDims = 4 dimA = 2, 160, 24, 32, strideA = 122880, 768, 32, 1, weight: FilterDescriptor 0x564629fb3660 type = CUDNN_DATA_FLOAT tensor_format = CUDNN_TENSOR_NCHW nbDims = 4 dimA = 160, 160, 3, 3, Pointer addresses: input: 0x7f1dbd800000 output: 0x7f1dbdaf0000 weight: 0x7f1ee96e1000 Forward algorithm: 5 could you please tell me where to make changes?

my environment info:- sys.platform linux Python 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0] numpy 1.21.3 detectron2 0.6 @/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/detectron2 Compiler GCC 7.3 CUDA compiler CUDA 11.1 detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6 DETECTRON2_ENV_MODULE PyTorch 1.8.1+cu111 @/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torch PyTorch debug build False GPU available Yes GPU 0 Quadro P2000 (arch=6.1) Driver version 470.86 CUDA_HOME /usr/local/cuda Pillow 9.0.0 torchvision 0.9.1+cu111 @/home/keb-pg/anaconda3/envs/CenterNet2/lib/python3.8/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20211023 iopath 0.1.9 cv2 4.5.5

i had installed the detectron2 using this :- python -m pip install detectron2 -f \ https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.8/index.html which is avilable on install.md readme file No, i havent added centernet2 folder later i.e after install detectron2

shoxa-mir commented 2 years ago

Why don't you check this link

GunjanPatel10 commented 2 years ago

@Shoxa-Mir That didnt helped me to solve my issue. Thank you if you could help me?