Open lukaszkn opened 3 years ago
Disclaimer: I've only been playing around with this for a couple hours and all of this is fairly new to me, but I hope others may find it useful.
Set reanalyse_on_gpu
in/on MuZeroConfig
to True, (or equal to torch.cuda.is_available())).
This avoids the error as it won't try to use the buffer from cuda on the cpu when performing the reanalyse stage.
You seem to be using Windows given the file paths are using \
Windows support is considered Experimental. As far as I can tell is due to Ray's Windows support being incomplete (ray-project/ray#199). That is a top-level issue for tracking Windows Support and despite it being closed, support is still incomplete.
I was able to run the sample (the Connect4 game) using CPU mode. This was because the pytorch I had installed didn't have the CUDA / GPU support so it ignored it. Since installing pytorch with cuda11, I've been getting the same error as you and I get the same error when running the sample from issue #66.
For reference my versions are ray 1.7.1, torch 1.10.0+cu113.
I came across a way to fix the error however I am not sure what consequence it has. Essentially, the change I made was to set reanalyse_on_gpu
to True, as it would seem the data for reanalyse stage is being run on the CPU, but the parameters/buffers are on the GPU.
As mentioned above, I am brand new to Ray and PyTorch. I suspect what might be a better solution is to transfer the data from cuda device to cpu device when transitioning to the reanalyse stage. Again, I have no idea if that is optimal or is a good idea.
I have only seen from 1s/it to almost 2s/it. when using GPU.
Disclaimer: I've only been playing around with this for a couple hours and all of this is fairly new to me, but I hope others may find it useful.
Short version:
Set
reanalyse_on_gpu
in/onMuZeroConfig
to True, (or equal to torch.cuda.is_available())).This avoids the error as it won't try to use the buffer from cuda on the cpu when performing the reanalyse stage.
Longer version
You seem to be using Windows given the file paths are using \
Windows support is considered Experimental. As far as I can tell is due to Ray's Windows support being incomplete (ray-project/ray#199). That is a top-level issue for tracking Windows Support and despite it being closed, support is still incomplete.
I was able to run the sample (the Connect4 game) using CPU mode. This was because the pytorch I had installed didn't have the CUDA / GPU support so it ignored it. Since installing pytorch with cuda11, I've been getting the same error as you and I get the same error when running the sample from issue #66.
For reference my versions are ray 1.7.1, torch 1.10.0+cu113.
I came across a way to fix the error however I am not sure what consequence it has. Essentially, the change I made was to set
reanalyse_on_gpu
to True, as it would seem the data for reanalyse stage is being run on the CPU, but the parameters/buffers are on the GPU. As mentioned above, I am brand new to Ray and PyTorch. I suspect what might be a better solution is to transfer the data from cuda device to cpu device when transitioning to the reanalyse stage. Again, I have no idea if that is optimal or is a good idea.I have only seen from 1s/it to almost 2s/it. when using GPU.
Same problem, I have tried this but it's not working. still return error
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu
version:
ray: 1.13.0 torch: 1.8.2+cu111
I have solved this problem, but not really "solved", bypassed. for example, if you try to run connect4, you need to change connect4.py find def init(self):
and change this line
self.reanalyse_on_gpu = False
to the following
self.reanalyse_on_gpu = True
self.train_on_gpu = True
self.selfplay_on_gpu = True
and it works fine.
if you want to play another game, just change the other .py file, add this 3 config. this parameter forces everything to work on GPU, so there won't be any cpu/gpu problems.
Any idea how to fix this error below? This happens for every sample game.
ray==1.5.0 torch==1.9.1+cu111
Thanks