youngwanLEE / vovnet-detectron2

[CVPR 2020] VoVNet backbone networks for detectron2
Other
376 stars 69 forks source link

Out of memory error - how to reduce batch size? #11

Open 9thDimension opened 4 years ago

9thDimension commented 4 years ago

I'm trying to train a small net on my own dataset. AWS P2 machine with ~12GB of GPU memory.

Getting the error below. Do you know what I can do, perhaps reduce batch size or something? How do I do that?

[05/16 14:46:32 d2.data.build]: Using training sampler TrainingSampler
[05/16 14:46:32 fvcore.common.checkpoint]: Loading checkpoint from https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:32 fvcore.common.file_io]: URL https://www.dropbox.com/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1 cached in /home/ubuntu/.torch/fvcore_cache/s/rptgw6stppbiw1u/vovnet19_ese_detectron2.pth?dl=1
[05/16 14:46:33 fvcore.common.checkpoint]: Some model parameters or buffers are not in the checkpoint:
  backbone.fpn_output5.{bias, weight}
  roi_heads.box_head.fc1.{bias, weight}
  roi_heads.box_predictor.bbox_pred.{weight, bias}
  roi_heads.mask_head.mask_fcn3.{weight, bias}
  roi_heads.mask_head.predictor.{bias, weight}
  backbone.fpn_output4.{bias, weight}
  backbone.fpn_output3.{weight, bias}
  proposal_generator.anchor_generator.cell_anchors.{0, 2, 3, 4, 1}
  proposal_generator.rpn_head.conv.{weight, bias}
  roi_heads.box_predictor.cls_score.{bias, weight}
  proposal_generator.rpn_head.objectness_logits.{bias, weight}
  roi_heads.mask_head.deconv.{bias, weight}
  roi_heads.box_head.fc2.{bias, weight}
  proposal_generator.rpn_head.anchor_deltas.{weight, bias}
  roi_heads.mask_head.mask_fcn1.{weight, bias}
  roi_heads.mask_head.mask_fcn2.{weight, bias}
  backbone.fpn_output2.{bias, weight}
  roi_heads.mask_head.mask_fcn4.{bias, weight}
  backbone.fpn_lateral2.{bias, weight}
  backbone.fpn_lateral4.{weight, bias}
  backbone.fpn_lateral5.{weight, bias}
  backbone.fpn_lateral3.{weight, bias}
[05/16 14:46:33 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  backbone.bottom_up.stem.stem_1/norm.num_batches_tracked
  backbone.bottom_up.stem.stem_2/norm.num_batches_tracked
  backbone.bottom_up.stem.stem_3/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.layers.0.OSA2_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.layers.1.OSA2_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.layers.2.OSA2_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage2.OSA2_1.concat.OSA2_1_concat/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.layers.0.OSA3_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.layers.1.OSA3_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.layers.2.OSA3_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage3.OSA3_1.concat.OSA3_1_concat/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.layers.0.OSA4_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.layers.1.OSA4_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.layers.2.OSA4_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage4.OSA4_1.concat.OSA4_1_concat/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.layers.0.OSA5_1_0/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.layers.1.OSA5_1_1/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.layers.2.OSA5_1_2/norm.num_batches_tracked
  backbone.bottom_up.stage5.OSA5_1.concat.OSA5_1_concat/norm.num_batches_tracked
[05/16 14:46:33 d2.engine.train_loop]: Starting training from iteration 0
ERROR [05/16 14:46:38 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
    loss_dict = self.model(data)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
    features = self.backbone(images.tensor)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
    bottom_up_features = self.bottom_up(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
    x = getattr(self, name)(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
    xt = self.concat(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
    return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)
[05/16 14:46:38 d2.engine.hooks]: Total training time: 0:00:05 (0:00:00 on hooks)
Traceback (most recent call last):
  File "train_net_docs.py", line 115, in <module>
    dist_url=args.dist_url,
  File "/home/ubuntu/detectron2/detectron2/engine/launch.py", line 57, in launch
    main_func(*args)
  File "train_net_docs.py", line 93, in main
    trainer.resume_or_load(resume=args.resume)
  File "/home/ubuntu/detectron2/detectron2/engine/defaults.py", line 401, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ubuntu/detectron2/detectron2/engine/train_loop.py", line 215, in run_step
    loss_dict = self.model(data)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/meta_arch/rcnn.py", line 121, in forward
    features = self.backbone(images.tensor)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/modeling/backbone/fpn.py", line 123, in forward
    bottom_up_features = self.bottom_up(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 367, in forward
    x = getattr(self, name)(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/projects/vovnet-detectron2/vovnet/vovnet.py", line 234, in forward
    xt = self.concat(x)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/ubuntu/virtualenvs/detectron_env_2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/detectron2/detectron2/layers/batch_norm.py", line 55, in forward
    return x * scale + bias
RuntimeError: CUDA out of memory. Tried to allocate 1.03 GiB (GPU 0; 11.17 GiB total capacity; 8.48 GiB already allocated; 845.31 MiB free; 10.03 GiB reserved in total by PyTorch)
9thDimension commented 4 years ago

in setup() - setting cfg.SOLVER.IMS_PER_BATCH = 8 made training run

sushilkhadkaanon commented 1 year ago

@9thDimension training works fine but when it comes to testing/evaluation it tries to allocate approx 4 GB. How do I reduce batchsize for test?