training out of memory - Githubissues

xingyizhou / ExtremeNet

Bottom-up Object Detection by Grouping Extreme and Center Points

BSD 3-Clause "New" or "Revised" License

1.03k stars 174 forks source link

training out of memory #15

Open wenhe-jia opened 5 years ago

wenhe-jia commented 5 years ago

When I tried to train ExtremeNet with my machine, I use 5 gpus same as reported in paper. There are 8 TITAN gpus on my machine, so I set the device_ids=[0,1,2,3,4] in ExtremeNet/nnet/py_factory.py, in DataParrallel function. But when I start the training progress, I got out of memory error as bellow: 屏幕快照 2019-03-27 下午8 45 34 Then I tried to use 8 TITAN gpus to train ExtremeNet, I set chunk_sizes=[3,3,3,3,3,3,3,3], which means there are 3 images per gpu, the training went well. Why does this happend? It seems that the memory were used out when computing loss, and the most memory cost is on gpu0.

xingyizhou commented 5 years ago

Hi, thanks for the report! On my machine, the memory cost of the master GPU is very close to 12GB. It happens on some other machines that it exceeds the memory limit. I don't have a better idea other than changing the chunk_size to [3, 5, 5, 5, 5] and batchsize to 23 ... I will let you know if I find a good way to reduce memory cost (e.g., by rewriting the loss in C++. But I won't guarantee when it will be done/ whether it works.).

wenhe-jia commented 5 years ago

@xingyizhou OK, I just use 8 gpus to train ExtremeNet, keeping batch_size=24.

wenhe-jia commented 5 years ago

When I trained ExtremeNet with 4 gpus, batch_size=12(3 imags/gpu), the memory usage of 4 gpus are 12GB(almost), 7GB, 7GB, 7GB. That means the model with 3 images takes 7GB gpu memory, I wonder what other operations are put on gpu 0?