Closed FXDevailly closed 2 years ago
Hi,
The main reason is indeed the computational cost on single CPU and GPU machines.
I haven't had much time lately but it's something to add yes. PRs are welcome :)
As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation ?
FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it.
Thank you
Hi,
The main reason is indeed the computational cost on single CPU and GPU machines.
I haven't had much time lately but it's something to add yes. PRs are welcome :)
Thanks for the quick reply ! I have submitted the corresponding PR :)
As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation ?
FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it.
Thank you
JohnPPP, I have submitted a PR which includes this :)
As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation?
FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it. Thank you
JohnPPP, I have submitted a PR which includes this :)
That is so great!
(How does this work? The PR must be accepted to be implemented in the code? We will have the option to use the revisiting?)
Edit: Nevermind, I have found the PR! I'm new to all this. I need to compare the code of the 5 files to see how you did it!
Thank you :D
As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation?
FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it. Thank you
JohnPPP, I have submitted a PR which includes this :)
That is so great!
(How does this work? The PR must be accepted to be implemented in the code? We will have the option to use the revisiting?)
Edit: Nevermind, I have found the PR! I'm new to all this. I need to compare the code of the 5 files to see how you did it!
Thank you :D
Great ! Actually, the option to reanalyze root values (state values) is already included in the repo. I'm simply suggesting to 'reanalyze' policy targets as well. And yes, it will need to be verified of course. There might be errors/inefficiency,
I gotta say FXDevailly, its nuts that you guys share your code, opinions, and improvements so openly. Very rare!
When your PR gets accepted, the "reanalyze" will always be for states and policy? By that I mean is there any way to switch back?
If I set self.use_updated_mcts_value_targets = False, then I'm "reanalyzing" both value and policy. If True, "just" value. Is this it?
Thank you, João
I gotta say FXDevailly, its nuts that you guys share your code, opinions, and improvements so openly. Very rare!
When your PR gets accepted, the "reanalyze" will always be for states and policy? By that I mean is there any way to switch back?
If I set self.use_updated_mcts_value_targets = False, then I'm "reanalyzing" both value and policy. If True, "just" value. Is this it?
Thank you, João
Big thanks to the previous contributors for their great work ! I'm completely new here... The current PR always reanalyzes policy targets. You could easily make it optional with an additional setting in the game files.
No. It just defines how the updated value targets are computed. If False it simply queries the representation + value networks. If True, it uses the values obtained via the re-execution of the MCTS.
In both cases, policy targets are updated according to the re-execution of the MCTS.
Thanks for the PR! I'm a bit busy, I haven't had the time to review yet.
But for those who want to use it directly (and have a really good CPU :) ), here's how:
git clone https://github.com/werner-duvaud/muzero-general.git
cd muzero-general/
git fetch origin pull/142/head
git checkout -b pullrequest FETCH_HEAD
As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation ?