yfeng997 / MadMario

Interactive tutorial to build a learning Mario, for first-time RL learners
192 stars 68 forks source link

Running out of GPU memory after several minutes training #7

Open ganzhi opened 3 years ago

ganzhi commented 3 years ago

Hi,

I got a CUDA out of memory issue after several minutes training. Is there a way to fix it?

(py38) C:\Src\GitHub\MadMario>python main.py Loading model at checkpoints\2021-02-20T16-13-06\trained_mario.chkpt with exploration rate 0.1 Episode 0 - Step 660 - Epsilon 0.1 - Mean Reward 2990.0 - Mean Length 660.0 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 10.198 - Time 2021-02-20T16:29:03 Episode 20 - Step 5262 - Epsilon 0.1 - Mean Reward 1311.095 - Mean Length 250.571 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 61.936 - Time 2021-02-20T16:30:05 Episode 40 - Step 9888 - Epsilon 0.1 - Mean Reward 1149.829 - Mean Length 241.171 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 62.843 - Time 2021-02-20T16:31:08 Episode 60 - Step 13407 - Epsilon 0.1 - Mean Reward 1072.361 - Mean Length 219.787 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 47.898 - Time 2021-02-20T16:31:56 Episode 80 - Step 19197 - Epsilon 0.1 - Mean Reward 1144.407 - Mean Length 237.0 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 77.715 - Time 2021-02-20T16:33:14 Episode 100 - Step 22474 - Epsilon 0.1 - Mean Reward 1060.12 - Mean Length 218.14 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 44.237 - Time 2021-02-20T16:33:58 Episode 120 - Step 26864 - Epsilon 0.1 - Mean Reward 1015.29 - Mean Length 216.02 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 58.86 - Time 2021-02-20T16:34:57 Episode 140 - Step 32109 - Epsilon 0.1 - Mean Reward 1094.56 - Mean Length 222.21 - Mean Loss 0.0 - Mean Q Value 0.0 - Time Delta 71.322 - Time 2021-02-20T16:36:08 Traceback (most recent call last): File "main.py", line 59, in action = mario.act(state) File "C:\Src\GitHub\MadMario\agent.py", line 57, in act state = torch.FloatTensor(state).cuda() if self.use_cuda else torch.FloatTensor(state) RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.00 GiB total capacity; 7.56 GiB already allocated; 0 bytes free; 7.74 GiB reserved in total by PyTorch)

oldschooler-dev commented 3 years ago

Hi, I also receive this error giving it a try. I tweaked the memory settings in the agent: self.memory = deque(maxlen=20000) in the agent.py line 13 from 100000 torch.FloatTensor I guess ! these are never freed which I think is the expectation to have this on the GPU for access during training(eg the memory for experiences)! I am not 100% as I am a journey of trying to fit this into a rtx2080 on Windows, pytorch cuda 1.7.1. I have tried the latest dev with the same results so thinks its down to not having a 32gb on the GPU, I am now up to 10,000 episodes after tweaking the memory to be less as above..fingers crossed its going to take another 24 hours I would imagine....not sure how many episodes I have to go through to get the results of the model provided though.

oldschooler-dev commented 3 years ago

and also the burn in I changed... self.burnin = 1e4

ganzhi commented 3 years ago

Created a PR to address the issue here: https://github.com/YuansongFeng/MadMario/pull/8

@oldschooler-dev, you can try my fix by cloning this repo: https://github.com/ganzhi/MadMario

oldschooler-dev commented 3 years ago

Seems to work ok, no memory errors...Cheers

LI-SUSTech commented 3 years ago

Seems to work ok, no memory errors...Cheers

how did you fix the problem. It seems no difference between @ganzhi fork between the master ...