zackchase / mxnet-the-straight-dope

An interactive book on deep learning. Much easy, so MXNet. Wow. [Straight Dope is growing up] ---> Much of this content has been incorporated into the new Dive into Deep Learning Book available at https://d2l.ai/.
https://d2l.ai/
Apache License 2.0
2.56k stars 725 forks source link

physical memory #419

Open kazizzad opened 6 years ago

kazizzad commented 6 years ago

When I run DQN code on EC2 deep learning AMI, using python3, htop shows me a slow growth of RAM memory usage (RES). For a long run of a week, this slow growth can kill the job. Any solution?

I set the replay buffer to a small value in order to observe the growing memory usage faster. p.s. gc did not help.

zackchase commented 6 years ago

Hey @mli and @piiswrong - any idea what's going on here? @kazizzad: can you post the link to the exact notebook and maybe produce some evidence of the memory leak. Let's figure out if it's a bug in the notebook or in MxNET.

kazizzad commented 6 years ago

Hi @zackchase, @mli, @piiswrong, Here is the link to the codes: https://github.com/kazizzad/BDQN-MxNet-Gluon/blob/master/BDQN.ipynb https://github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter17_deep-reinforcement-learning/DDQN.ipynb https://github.com/zackchase/mxnet-the-straight-dope/blob/master/chapter17_deep-reinforcement-learning/DQN.ipynb and

For all three, no matter I use notebook to run or convert them first to scripts, using "jupyter nbconvert --to script", then use python to run them, the physical memory usage goes up fast. I am using EC2 p2.8. For a single game, the memory usage might go above 100G.

In these codes, we do not have any data iterator, since we collect the data. We store a window of previously observed data in a buffer of fixed size, called replay buffer. If you run the code, at the very beginning, the increase in the memory usage is due to filling the replay buffer (in cpu memory), but when the replay buffer is full, still the memory usage goes up.

In order to observe the memory leak, I would suggest for any of these codes, reduce some of the hyper paramters, e.g. replay buffer size.

self.replay_buffer_size = 1000 self.Target_update = 1000 self.replay_start_size = 500