RuntimeError: CUDA out of memory in validation

xiapengchng commented 5 years ago

progress | sum | jmap | lmap | joff | lpos | lneg | speed Running validation... Traceback (most recent call last): File "./train.py", line 180, in main() File "./train.py", line 172, in main trainer.train() File "/mnt/lustre/xiapengcheng/LSD/lcnn/lcnn/trainer.py", line 290, in train self.train_epoch() File "/mnt/lustre/xiapengcheng/LSD/lcnn/lcnn/trainer.py", line 202, in train_epoch self.validate() File "/mnt/lustre/xiapengcheng/LSD/lcnn/lcnn/trainer.py", line 121, in validate result = self.model(input_dict) File "/mnt/lustre/share/spring/envs/r0.3.0/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/mnt/lustre/xiapengcheng/LSD/lcnn/lcnn/models/line_vectorizer.py", line 84, in forward

x[i, :, px1l, py1l] (px - px0) (py - py0) RuntimeError: CUDA out of memory. Tried to allocate 2.74 GiB (GPU 0; 10.91 GiB total capacity; 6.10 GiB already allocated; 2.56 GiB free; 1.35 GiB cached)

I try to reproducte the paper result, I use 1080TI, it 's works fine with training, but went wrong in validate, I have changed the batch size to 1. and I have even try to use V100(16G) but the same error

zhou13 commented 5 years ago

Out of memory issue was tracked in #8.

zhou13 commented 5 years ago

BTW, does the problem still happen in the current master branch? I added two commits 6 hours ago.

xiapengchng commented 5 years ago

Thank you for your quickly reply! with the current master branch, the batch_size=6 will reproduce the same error, when batch size changes to 4, every thing works fine. ps: Is it possible to train with multi-gpu

zhou13 commented 5 years ago

It is not possible with the released version right now. We have a multi-gpu version but it is not stable enough to be released. Also, ShanghaiTech is a relatively small dataset. It only takes several hours to get resonable result.

zhou13 / lcnn

RuntimeError: CUDA out of memory in validation #9