Closed yffbit closed 4 years ago
Hi,
The code has been tested with GPUs on ubuntu and without GPU on macOS, the selfplay seems to be on the right device.
What you describe should not happen because we are using Ray to manage the visible GPUs, so if you set selfplay_device = 'cpu'
, torch should not detect any GPU on the machine for the selfplay, hence forcing DataParallel to be in cpu mode.
On Ubuntu at least it seems alright to use DataParallel without GPU.
Are you sure that it is the selfplay that runs on your GPU ? Can you post the precise error you are having ?
ps: MuZero and Ray on windows are experimental for now.
The output is :
2020-08-11 21:07:52,526 INFO resource_spec.py:212 -- Starting Ray with 4.54 GiB memory available for workers and up to 2.29 GiB for objects. You can adjust these settings with ray.init(memory=torch.save
instead
warnings.warn("pickle support for Storage will be removed in 1.5. Use torch.save
instead", FutureWarning)
2020-08-11 21:07:55,628 WARNING worker.py:1047 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task.
2020-08-11 21:07:55.694031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
Training... Run tensorboard --logdir ./results and go to http://localhost:6006/ to see in real time the training performance.
(pid=19800) D:\ProgramFiles\Anaconda3\lib\site-packages\torch\storage.py:34: FutureWarning: pickle support for Storage will be removed in 1.5. Use torch.save
instead
(pid=19800) warnings.warn("pickle support for Storage will be removed in 1.5. Use torch.save
instead", FutureWarning)
2020-08-11 21:08:07,997 ERROR worker.py:987 -- Possible unhandled error from worker: ray::SelfPlay.continuous_self_play() (pid=18576, ip=)
File "python\ray_raylet.pyx", line 446, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor
File "D:\ProgramFiles\Anaconda3\lib\site-packages\ray\function_manager.py", line 567, in actor_method_executor
raise e
File "D:\ProgramFiles\Anaconda3\lib\site-packages\ray\function_manager.py", line 559, in actor_method_executor
method_returns = method(actor, *args, *kwargs)
File "C:\Users\yff\Desktop\muzero-general-master\self_play.py", line 51, in continuous_self_play
0,
File "C:\Users\yff\Desktop\muzero-general-master\self_play.py", line 153, in play_game
True,
File "C:\Users\yff\Desktop\muzero-general-master\self_play.py", line 295, in run
) = model.initial_inference(observation)
File "C:\Users\yff\Desktop\muzero-general-master\models.py", line 161, in initial_inference
encoded_state = self.representation(observation)
File "C:\Users\yff\Desktop\muzero-general-master\models.py", line 123, in representation
observation.view(observation.shape[0], -1)
File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in call
result = self.forward(input, kwargs)
File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 149, in forward
"them on device: {}".format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu**
I think the error has nothing to do with Ray. I did a simple test. The RuntimeError is same.
import torch cuda, cpu = 'cuda', 'cpu' model = torch.nn.Linear(20,5) input = torch.randn(100,20) model.to(cuda) o1 = model(input.to(cuda)) model.to(cpu) o2 = model(input.to(cpu)) model_parallel = torch.nn.DataParallel(model) o3 = model_parallel(input) o4 = model_parallel(input.to(cuda)) model_parallel.to(cpu) o5 = model_parallel(input.to(cpu)) o6 = model_parallel.module(input)
Traceback (most recent call last):
File "D:/Python/torch_test1.py", line 13, in
Normally I use Ray to prevent this error from happening.
It may be related to the first error which is due to the fact that you misconfigured the distribution of the GPU between the workers. Can you share your configuration to fix this problem and then check that you still have the second error?
Here is how we can avoid it with Ray. Can you confirm me that this works for you too ?
import torch
import ray
ray.init()
cuda, cpu = 'cuda', 'cpu'
model = torch.nn.Linear(20,5)
input = torch.randn(100,20)
model.to(cuda)
o1 = model(input.to(cuda))
model.to(cpu)
o2 = model(input.to(cpu))
model_parallel = torch.nn.DataParallel(model)
o3 = model_parallel(input)
o4 = model_parallel(input.to(cuda))
model_parallel.to(cpu)
#just added this
@ray.remote(num_gpus=0)
def model_parallel_cpu(input):
model_parallel = torch.nn.DataParallel(model).to(cpu)
return model_parallel(input)
o5 = ray.get(model_parallel_cpu.remote(input.to(cpu)))
o6 = model_parallel.module(input)
What do you mean about configuration? configuration for what? I use the raw code. Nothing has been modified.
Ok sorry it was unclear, I was talking about the MuZeroConfig
class in the game file in the games folder. What default game are you trying to train then ?
There is something strange because Ray requested more than 1 GPU but the default game configurations only need 1 GPU.
Here is how we can avoid it with Ray. Can you confirm me that this works for you too ?
import torch import ray ray.init() cuda, cpu = 'cuda', 'cpu' model = torch.nn.Linear(20,5) input = torch.randn(100,20) model.to(cuda) o1 = model(input.to(cuda)) model.to(cpu) o2 = model(input.to(cpu)) model_parallel = torch.nn.DataParallel(model) o3 = model_parallel(input) o4 = model_parallel(input.to(cuda)) model_parallel.to(cpu) #just added this @ray.remote(num_gpus=0) def model_parallel_cpu(input): model_parallel = torch.nn.DataParallel(model).to(cpu) return model_parallel(input) o5 = ray.get(model_parallel_cpu.remote(input.to(cpu))) o6 = model_parallel.module(input)
It doesn't work for me. The error is the same.
Traceback (most recent call last):
File "D:/Python/torch_test2.py", line 22, in
Ok so I think you get the error because Ray is still experimental on windows. To confirm this, can you please run this:
import torch
import ray
ray.init()
@ray.remote(num_gpus=0)
def test():
print(torch.cuda.is_available())
ray.get(test.remote())
It prints False on Ubuntu with ray 0.8.6 and torch 1.6. I suspect you will have True.
Ok so I think you get the error because Ray is still experimental on windows. To confirm this, can you please run this:
import torch import ray ray.init() @ray.remote(num_gpus=0) def test(): print(torch.cuda.is_available()) ray.get(test.remote())
It prints False for me. I suspect you will have True.
Yes. It prints True. My Ray version is 0.8.6.
As stated in the readme, I do not expect MuZero to (perfectly) work on Windows for now. I'll wait until Ray is stable on windows to fix those kind of issues. You can still use Google Colab to run MuZero. (Also maybe putting every models on your gpu will make it run.)
This issue is linked from Ray, closing it leaves a false impression that it's works.
How much work do we think it will be to just use PyTorch DataLoader and ditch Ray?
I don't know if hiding the GPU from PyTorch with num_gpus=0 is in Ray's specifications, so I wanted to wait for the stable version of Ray for windows before tackling it.
Another solution in the meantime is to use WSL on Windows .
Concerning Dataloader I don't really understand the use here, we use ray for multi processing. For the moment I'm satisfied with Ray but a replacement could be torch.distributed if someone want to try.
I got this same issue, I don't think it's ray problem.
I've run this
import torch
cuda, cpu = 'cuda', 'cpu'
model = torch.nn.Linear(20,5)
input = torch.randn(100,20)
model.to(cuda)
o1 = model(input.to(cuda))
model.to(cpu)
o2 = model(input.to(cpu))
model_parallel = torch.nn.DataParallel(model)
o3 = model_parallel(input)
o4 = model_parallel(input.to(cuda))
model_parallel.to(cpu)
o5 = model_parallel(input.to(cpu))
o6 = model_parallel.module(input)
and it returns the same error as yffbit mentioned.
o5 = model_parallel(input.to(cpu))
Traceback (most recent call last):
File "
There seems to be a bug on Windows 10 with cuda devices.
torch.nn.DataParallel(model)
will move model parameters and buffers to the GPU even ifselfplay_device = 'cpu'
. If you move the model to cpumodel.to(torch.device('cpu'))
after the model is created, the inference process will raise a RuntimeError.It means DataParallel model can only run on cuda if cuda is available. Similar issue https://discuss.pytorch.org/t/does-dataparallel-matters-in-cpu-mode/7587