System memory usage increase in training

zgojcic / Rigid3DSceneFlow

[CVPR 2021, Oral] "Weakly Supervised Learning of Rigid 3D Scene Flow"

137 stars 18 forks source link

System memory usage increase in training #12

Closed Alt216 closed 2 years ago

Alt216 commented 2 years ago

Hi, when I run python train.py ./configs/train/train_weakly_supervised.yaml to train the network from scratch using our dataset, my system memory usage will slowly increase until it max out the system memory and then the traning will crash. I have 16gb of system memory and the training can only go on for a little more than one epoch with ~16000 training samples. I tried to lower the num_workers to 4 and lower the batch size to 2 but they didn't seem to resolve the issue.

zgojcic commented 2 years ago

Hi,

I have never observed this issue, but it is also true that I always had more than 16gb memory. Can you try to somehow find where the memory leak is or try to train it on a machine with 32gb ram?

For the other issue that you posted: I have uploaded the preprocesing scripts that we have used for semantic_kitti and other datasets

Alt216 commented 2 years ago

Thank you very much @zgojcic ! I will look into the possible memory leaks, and thanks for the preprocesing scripts.

For the memory issue, I added .detach() on line 145 and 151 in train.py as I saw some suggestion on the pytorch forum regarding storing the complete computation graph when adding losses. I am still unsure whether this is the issue but I will try to run the training with this modification.

The lines of code

suggestion I found

Alt216 commented 2 years ago

After some more time searching on the web, I found this that could be a possible explanation. Maybe it has to do with the dataloader iterating across lists and dicts which adds up over time? The suggested solution is to replace them with numpy arrays.

zgojcic commented 2 years ago

Hi @Alt216 this could indeed be the case, at the moment I do not have time to investigate this (especially as it work ok on machines with more RAM), but if you can find the solution it would be great if you can make a PR.

Best Zan

zgojcic commented 2 years ago

Closing due to inactivity.