werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.52k stars 613 forks source link

Self-play very slow and inefficient on GPU (self.selfplay_on_gpu = True) #159

Open kevaday opened 3 years ago

kevaday commented 3 years ago

Hello,

I've been having issues with doing self-play on GPU, and after about a week of experimentation I've realized that it is necessary to use this option if I want to train a model on my custom game in a reasonable amount of time (hours or days versus weeks or months), considering the complexity of my game (action space of 14641).

I've also done some extensive testing with the self.selfplay_on_gpu option set to True in the config, and I've noticed that self play is a bewildering 2-3 times slower on GPU than on CPU. Moreover, running on GPU consumes more than twice the RAM which also doesn't make much sense because the networks should be moved to vRAM in theory. I have 16GB of RAM and running on GPU only allows one worker on most tasks, otherwise all 16GB run out instantly when playing self play games.

I added a metric for measuring the amount of time taken for each step in the game in the self play loop, and here's an example graph from tensorboard of that metric during a test (red is GPU, green is CPU, y-axis is time in seconds for a self play step):

(Sorry about the picture instead of screenshot)

I don't have an example screenshot of memory usage at the moment, but a common occurrence was 30-40% usage with CPU, and 80-90% (dangerously going into swap) on GPU, both using the same parameters and 1 worker.

My best guess is that MCTS is somehow bottlenecking the self play process because, looking through the code, it would seem that MCTS is very dependent on CPU other than the part where the model's inference is called, resulting in a slow down because the model must constantly be moved to GPU while still waiting on the CPU.

It would be very important for me to train on my game faster because currently running on CPU is unbearable for me (I let it run for 24 hours and it only played 50 games, for from enough for effective learning). If you could please look into it and see if there is a possible solution, I would very much appreciate it. I love this library and it is exactly what I'm looking for, so it would be amazing if I could train within a reasonable timescale.

Here is some basic hardware information if it help with anything: CPU: i5-4690, 4 cores & 3.5GHz (3.9 with turbo) RAM: 16GB DDR3 GPU: GTX 1070 OC with 8GB vRAM

Thank you very much for your time and for making the use of MuZero for the public a reality! -Kevaday

EDIT: I did a bit of researching, and based on previous conversations in the discord server, it seems that the problem occurs from the inefficient transfer of data between CPU and GPU. AlphaGo Zero solved this problem as described in the following article: image (source: https://web.stanford.edu/~surag/posts/alphazero.html)

As shown above, MCTS searches can in fact be parallelized by creating batches and passing all of them to the network running on the GPU, allowing for a more efficient transfer of information between CPU and GPU if I understood correctly. I'm not exactly sure how the current MCTS methods in the library can be modified to achieve this, but it would be extremely beneficial and much appreciate by many in this community, and I would be glad to help out where possible.

Here is another article which describes this process in slightly more detail if it helps, under the section titled "MCTS with Virtual Losses": https://www.moderndescartes.com/essays/agz/

jonathan-laurent commented 3 years ago

Author or AlphaZero.jl here.

As this issue correctly hints at, a limitation of the current implementation is that the neural network is evaluated on one state at a time during self-play. On such a small workload, the GPU latency alone can exceed the CPU inference cost.

In order to leverage a GPU during self-play, you need to evaluate the network on big batches of states.

One possible way to achieve this is to use an asychronous MCTS implementation indeed where many workers are exploring the same tree simultaneously and sending inference requests to a shared server. A virtual loss can then be introduced to ensure that all workers are not always exploring the same branch. However, this comes at a cost as the virtual loss can introduce a significant exploration bias, especially when the number of MCTS simulations is not very high, leading to diminished search performances (LC0 made some nice experiments on this).

Therefore, a better and easier solution is to have several asynchronous or parallel self-play workers and batch inference requests across game simulations. For example, in the connect-four demo of AlphaZero.jl, we can simulate 128 games in parallel and therefore we do not need to resort to async MCTS to reach good GPU utilization. Async MCTS can still be used in complement to this technique and it is especially useful in cases where the amount of available RAM memory does not allow you to maintain as many MCTS trees simultaneously.

Given that this implementation already uses ray, it should not be too hard to spawn several self-play workers by decoupling the network inference part and putting it into a separate actor. Doing so would result in dramatic performance gains on GPUs and allow more challenging environments to be solved.

@werner-duvaud Thanks for your hard work on making MuZero more accessible! I would be happy to engage into further discussion and to share more about my experience working with AlphaZero if it can be helpful to you.