werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.5k stars 610 forks source link

Windows: num_gpus=0 but cuda.is_available() returns True #66

Closed yffbit closed 4 years ago

yffbit commented 4 years ago

There seems to be a bug on Windows 10 with cuda devices. torch.nn.DataParallel(model) will move model parameters and buffers to the GPU even if selfplay_device = 'cpu'. If you move the model to cpu model.to(torch.device('cpu')) after the model is created, the inference process will raise a RuntimeError.

def init(self, module, device_ids=None, output_device=None, dim=0): super(DataParallel, self).init()

    if not torch.cuda.is_available():
        self.module = module
        self.device_ids = []
        return

    if device_ids is None:
        device_ids = list(range(torch.cuda.device_count()))
    if output_device is None:
        output_device = device_ids[0]

    self.dim = dim
    self.module = module
    self.device_ids = list(map(lambda x: _get_device_index(x, True), device_ids))
    self.output_device = _get_device_index(output_device, True)
    self.src_device_obj = torch.device("cuda:{}".format(self.device_ids[0]))

    _check_balance(self.device_ids)

    if len(self.device_ids) == 1:
        self.module.cuda(device_ids[0])

def forward(self, *inputs, **kwargs):
    if not self.device_ids:
        return self.module(*inputs, **kwargs)

    for t in chain(self.module.parameters(), self.module.buffers()):
        if t.device != self.src_device_obj:
            raise RuntimeError("module must have its parameters and buffers "
                               "on device {} (device_ids[0]) but found one of "
                               "them on device: {}".format(self.src_device_obj, t.device))

It means DataParallel model can only run on cuda if cuda is available. Similar issue https://discuss.pytorch.org/t/does-dataparallel-matters-in-cpu-mode/7587

werner-duvaud commented 4 years ago

Hi, The code has been tested with GPUs on ubuntu and without GPU on macOS, the selfplay seems to be on the right device. What you describe should not happen because we are using Ray to manage the visible GPUs, so if you set selfplay_device = 'cpu', torch should not detect any GPU on the machine for the selfplay, hence forcing DataParallel to be in cpu mode. On Ubuntu at least it seems alright to use DataParallel without GPU.

Are you sure that it is the selfplay that runs on your GPU ? Can you post the precise error you are having ?

ps: MuZero and Ray on windows are experimental for now.

yffbit commented 4 years ago

The output is :

2020-08-11 21:07:52,526 INFO resource_spec.py:212 -- Starting Ray with 4.54 GiB memory available for workers and up to 2.29 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2020-08-11 21:07:53,988 INFO services.py:1165 -- View the Ray dashboard at localhost:8265 D:\ProgramFiles\Anaconda3\lib\site-packages\torch\storage.py:34: FutureWarning: pickle support for Storage will be removed in 1.5. Use torch.save instead warnings.warn("pickle support for Storage will be removed in 1.5. Use torch.save instead", FutureWarning) 2020-08-11 21:07:55,628 WARNING worker.py:1047 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is infeasible and cannot currently be scheduled. It requires {CPU: 1.000000}, {GPU: 1.000000} for execution and {CPU: 1.000000}, {GPU: 1.000000} for placement, however there are no nodes in the cluster that can provide the requested resources. To resolve this issue, consider reducing the resource requests of this task or add nodes that can fit the task. 2020-08-11 21:07:55.694031: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll

Training... Run tensorboard --logdir ./results and go to http://localhost:6006/ to see in real time the training performance.

(pid=19800) D:\ProgramFiles\Anaconda3\lib\site-packages\torch\storage.py:34: FutureWarning: pickle support for Storage will be removed in 1.5. Use torch.save instead (pid=19800) warnings.warn("pickle support for Storage will be removed in 1.5. Use torch.save instead", FutureWarning) 2020-08-11 21:08:07,997 ERROR worker.py:987 -- Possible unhandled error from worker: ray::SelfPlay.continuous_self_play() (pid=18576, ip=) File "python\ray_raylet.pyx", line 446, in ray._raylet.execute_task File "python\ray_raylet.pyx", line 400, in ray._raylet.execute_task.function_executor File "D:\ProgramFiles\Anaconda3\lib\site-packages\ray\function_manager.py", line 567, in actor_method_executor raise e File "D:\ProgramFiles\Anaconda3\lib\site-packages\ray\function_manager.py", line 559, in actor_method_executor method_returns = method(actor, *args, *kwargs) File "C:\Users\yff\Desktop\muzero-general-master\self_play.py", line 51, in continuous_self_play 0, File "C:\Users\yff\Desktop\muzero-general-master\self_play.py", line 153, in play_game True, File "C:\Users\yff\Desktop\muzero-general-master\self_play.py", line 295, in run ) = model.initial_inference(observation) File "C:\Users\yff\Desktop\muzero-general-master\models.py", line 161, in initial_inference encoded_state = self.representation(observation) File "C:\Users\yff\Desktop\muzero-general-master\models.py", line 123, in representation observation.view(observation.shape[0], -1) File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(input, kwargs) File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 149, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu**

I think the error has nothing to do with Ray. I did a simple test. The RuntimeError is same.

import torch cuda, cpu = 'cuda', 'cpu' model = torch.nn.Linear(20,5) input = torch.randn(100,20) model.to(cuda) o1 = model(input.to(cuda)) model.to(cpu) o2 = model(input.to(cpu)) model_parallel = torch.nn.DataParallel(model) o3 = model_parallel(input) o4 = model_parallel(input.to(cuda)) model_parallel.to(cpu) o5 = model_parallel(input.to(cpu)) o6 = model_parallel.module(input)

Traceback (most recent call last): File "D:/Python/torch_test1.py", line 13, in o5 = model_parallel(input.to(cpu)) File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, kwargs) File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 149, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu**

werner-duvaud commented 4 years ago

Normally I use Ray to prevent this error from happening.

It may be related to the first error which is due to the fact that you misconfigured the distribution of the GPU between the workers. Can you share your configuration to fix this problem and then check that you still have the second error?

werner-duvaud commented 4 years ago

Here is how we can avoid it with Ray. Can you confirm me that this works for you too ?

import torch
import ray

ray.init()
cuda, cpu = 'cuda', 'cpu'
model = torch.nn.Linear(20,5)
input = torch.randn(100,20)
model.to(cuda)
o1 = model(input.to(cuda))
model.to(cpu)
o2 = model(input.to(cpu))
model_parallel = torch.nn.DataParallel(model)
o3 = model_parallel(input)
o4 = model_parallel(input.to(cuda))
model_parallel.to(cpu)

#just added this
@ray.remote(num_gpus=0)
def model_parallel_cpu(input):
    model_parallel = torch.nn.DataParallel(model).to(cpu)
    return model_parallel(input)
o5 = ray.get(model_parallel_cpu.remote(input.to(cpu)))

o6 = model_parallel.module(input)
yffbit commented 4 years ago

What do you mean about configuration? configuration for what? I use the raw code. Nothing has been modified.

werner-duvaud commented 4 years ago

Ok sorry it was unclear, I was talking about the MuZeroConfig class in the game file in the games folder. What default game are you trying to train then ? There is something strange because Ray requested more than 1 GPU but the default game configurations only need 1 GPU.

yffbit commented 4 years ago

Here is how we can avoid it with Ray. Can you confirm me that this works for you too ?

import torch
import ray

ray.init()
cuda, cpu = 'cuda', 'cpu'
model = torch.nn.Linear(20,5)
input = torch.randn(100,20)
model.to(cuda)
o1 = model(input.to(cuda))
model.to(cpu)
o2 = model(input.to(cpu))
model_parallel = torch.nn.DataParallel(model)
o3 = model_parallel(input)
o4 = model_parallel(input.to(cuda))
model_parallel.to(cpu)

#just added this
@ray.remote(num_gpus=0)
def model_parallel_cpu(input):
  model_parallel = torch.nn.DataParallel(model).to(cpu)
  return model_parallel(input)
o5 = ray.get(model_parallel_cpu.remote(input.to(cpu)))

o6 = model_parallel.module(input)

It doesn't work for me. The error is the same. Traceback (most recent call last): File "D:/Python/torch_test2.py", line 22, in o5 = ray.get(model_parallel_cpu.remote(input.to(cpu))) File "D:\ProgramFiles\Anaconda3\lib\site-packages\ray\worker.py", line 1474, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::main.model_parallel_cpu() (pid=4828, ip=) File "python\ray_raylet.pyx", line 446, in ray._raylet.execute_task File "D:/Python/torch_test2.py", line 21, in model_parallel_cpu return model_parallel(input) File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 550, in call result = self.forward(*input, **kwargs) File "D:\ProgramFiles\Anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 149, in forward "them on device: {}".format(self.src_device_obj, t.device)) RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu

werner-duvaud commented 4 years ago

Ok so I think you get the error because Ray is still experimental on windows. To confirm this, can you please run this:

import torch
import ray

ray.init()

@ray.remote(num_gpus=0)
def test():
    print(torch.cuda.is_available())

ray.get(test.remote())

It prints False on Ubuntu with ray 0.8.6 and torch 1.6. I suspect you will have True.

yffbit commented 4 years ago

Ok so I think you get the error because Ray is still experimental on windows. To confirm this, can you please run this:

import torch
import ray

ray.init()

@ray.remote(num_gpus=0)
def test():
    print(torch.cuda.is_available())

ray.get(test.remote())

It prints False for me. I suspect you will have True.

Yes. It prints True. My Ray version is 0.8.6.

werner-duvaud commented 4 years ago

As stated in the readme, I do not expect MuZero to (perfectly) work on Windows for now. I'll wait until Ray is stable on windows to fix those kind of issues. You can still use Google Colab to run MuZero. (Also maybe putting every models on your gpu will make it run.)

https://github.com/ray-project/ray/issues/9114

jefflomax commented 4 years ago

This issue is linked from Ray, closing it leaves a false impression that it's works.
How much work do we think it will be to just use PyTorch DataLoader and ditch Ray?

werner-duvaud commented 4 years ago

I don't know if hiding the GPU from PyTorch with num_gpus=0 is in Ray's specifications, so I wanted to wait for the stable version of Ray for windows before tackling it.

Another solution in the meantime is to use WSL on Windows .

Concerning Dataloader I don't really understand the use here, we use ray for multi processing. For the moment I'm satisfied with Ray but a replacement could be torch.distributed if someone want to try.

digits122 commented 2 years ago

I got this same issue, I don't think it's ray problem.

I've run this

import torch
cuda, cpu = 'cuda', 'cpu'
model = torch.nn.Linear(20,5)
input = torch.randn(100,20)
model.to(cuda)
o1 = model(input.to(cuda))
model.to(cpu)
o2 = model(input.to(cpu))
model_parallel = torch.nn.DataParallel(model)
o3 = model_parallel(input)
o4 = model_parallel(input.to(cuda))
model_parallel.to(cpu)
o5 = model_parallel(input.to(cpu))
o6 = model_parallel.module(input)

and it returns the same error as yffbit mentioned.

o5 = model_parallel(input.to(cpu)) Traceback (most recent call last): File "", line 1, in File "C:\Program Files\Python39\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "C:\Program Files\Python39\lib\site-packages\torch\nn\parallel\data_parallel.py", line 153, in forward raise RuntimeError("module must have its parameters and buffers " RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cpu