werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.46k stars 606 forks source link

Unknown error related to ray each time on exiting the run #10

Closed littleV closed 4 years ago

littleV commented 4 years ago

Hi @werner-duvaud

When I'm running the latest code I always get an error when exiting the process. Please see attached screen output below:

I did delete old repo and had a fresh check out, still seeing this.

so I did some research, found this: https://github.com/ray-project/ray/issues/5042 and this: https://github.com/ray-project/ray/issues/6239 Hope it can help you.

Welcome to MuZero! Here's a list of games:
0. cartpole
1. connect4
2. gomoku
3. lunarlander
Enter a number to choose the game: 2

0. Train
1. Load pretrained model
2. Render some self play games
3. Play against MuZero
4. Exit
Enter a number to choose an action: 0
2020-02-25 15:21:47,620 INFO resource_spec.py:212 -- Starting Ray with 3.91 GiB memory available for workers and up to 1.97 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-02-25 15:21:47,998 INFO services.py:1093 -- View the Ray dashboard at localhost:8265

Training...
Run tensorboard --logdir ./results and go to http://localhost:6006/ to see in real time the training performance.

Done test reward: 1.00. Training step: 11/10. Played games: 1. Loss: 33.26

0. Train
1. Load pretrained model
2. Render some self play games
3. Play against MuZero
4. Exit
Enter a number to choose an action: 4
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
Exception ignored in: <function ActorHandle.__del__ at 0x1115289e0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/actor.py", line 655, in __del__
AttributeError: 'NoneType' object has no attribute 'get_global_worker'
werner-duvaud commented 4 years ago

Hi,

Do you use conda? Does this happen every time and for all games?

I can't reproduce it on a fresh installation with pip on linux in a python3 virtual environment.

It seems to be related to the shutdown of Ray. Maybe replacing ray.shutdown() with:

shared_storage_worker.__ray_kill__()
replay_buffer_worker.__ray_kill__()
test_worker.__ray_kill__()
for worker in self_play_workers:
    worker.__ray_kill__()
ray.shutdown()

will solve it.

littleV commented 4 years ago

Hi @werner-duvaud

I have two laptops, only one has the issue. I'll try your method.

Btw, I might be able to train gomoku game on AWS, working on a strategy to lower the cost.

Also testing around adding new games. So I'm trying to use abstract classes to standardize the creation of a game. Is this something you would be interested in adding to the project? I'll be using python3 syntax. If not, I'll drop the project and just using what's available now.

Thanks!

werner-duvaud commented 4 years ago

Hi,

About the bug, do you have an idea of ​​the difference between your laptops that could cause the bug ?

For AWS, it's great. Don't hesitate if you need help.

We are interested in adding abstract class, do not hesitate to do a PR.

Thanks

littleV commented 4 years ago

Hi @werner-duvaud,

I don't have the other laptop with me now to tell you the differences. I'll update once I have the info.

Here's some more information:

  1. Tried you code, didn't work
  2. Uninstalled all the dependencies and deleted the repo, fresh check out, install from requirements, for some reason I had to reinstall tensorflow too. Still seeing the error.
  3. Wrote a simple program with Ray, was not able to reproduce the same error.

I'm new to ray, to run this project, how do I make sure the training happens on GPU? And does the test episodes have to happen on GPU?

I want to set up a project training on GPU and when it's done copy the model to a CPU only machine for playing, thus saving the resources.

werner-duvaud commented 4 years ago

Hi,

Ok, it might be interesting to revert the last commit to see if the bug is related to it.

In the __init__ of Ray actors we specify whether the model should be on the CPU or the GPU. The training part takes place on the GPU if there is one, the self play part takes place on the CPU. You can check it by displaying next(self.model.parameters()).is_cuda in the corresponding Ray actors.

Normally test episodes take place on CPU since it is a kind of self-play.

If you have a GPU the training_device parameter in the config should be "cuda" and the model will be trained on GPU. The model is automatically saved.

To use a model, just copy model.weights and load it with the load pretrained model option then render some self play. It should also work on CPU even if the model has been trained on GPU.