Stage I: OOM Error when trying to train the baseline model

alec3010 commented 5 years ago

Hi yxgeee,

I am running this with pytorch 1.1 using the python 3.6 interpreter in ubuntu 16.04. The Machine I'm using has a 1080 Ti with 11GBs of Memory, so I believe it should work hardware-wise. The dataset is loaded correctly but I get the following error when i am trying to train the baseline model:

Traceback (most recent call last): File "baseline.py", line 201, in main(parser.parse_args()) File "baseline.py", line 143, in main trainer.train(epoch, train_loader, optimizer, base_lr=args.lr) File "/home/qbiik/Alex/Algorithmen/FD-GAN/reid/trainers.py", line 32, in train loss, prec1 = self._forward(inputs, targets) File "/home/qbiik/Alex/Algorithmen/FD-GAN/reid/trainers.py", line 70, in forward , _, outputs = self.model(inputs) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/qbiik/Alex/Algorithmen/FD-GAN/reid/models/multi_branch.py", line 13, in forward x1, x2 = self.base_model(x1), self.base_model(x2) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/qbiik/Alex/Algorithmen/FD-GAN/reid/models/resnet.py", line 69, in forward x = module(x) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torchvision/models/resnet.py", line 88, in forward out = self.bn3(out) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 83, in forward exponential_average_factor, self.eps) File "/home/qbiik/Alex/venv/FD-GAN/lib/python3.6/site-packages/torch/nn/functional.py", line 1697, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 10.91 GiB total capacity; 10.03 GiB already allocated; 256.94 MiB free; 20.18 MiB cached)

If you find the time, I'd greatly appreciate your help. :)

Best Regards

liulitianji commented 5 years ago

When using PyTorch >= 0.4.0, please use with torch.no_grad(): in the inference stage before the for loop.

alec3010 commented 5 years ago

Hi y'all

Thanks a lot. To make it work on pytorch 1.1.0, I also converted the loss-tensor to a scalar by the use of "tensor.item()" in "losses.update(loss.data.item(), targets.size(0))", as pytorch 1.1.0 does not treat scalars as tensors anymore. (Previous formulation: "losses.update(loss.data[0], targets.size(0))")

Furthermore, after using "with torch.no_grad()" on the train-function, the loss-tensor needs to be rewritten by the use of 'loss = Variable(loss, requires_grad=True)' before loss.backward is used, as loss.backward needs gradients to work.

Maybe those insights are self-evident for AI developers who are more experienced than I am. But I thought this info might make it easier for others like me so I decided to share them.

Best Regards

yxgeee / FD-GAN

Stage I: OOM Error when trying to train the baseline model #31