werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.5k stars 610 forks source link

Reason for not re-executing MCTS and updating policy "targets" (child visits) in "Reanalyze" ? #138

Closed FXDevailly closed 2 years ago

FXDevailly commented 3 years ago

As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation ?

werner-duvaud commented 3 years ago

Hi,

The main reason is indeed the computational cost on single CPU and GPU machines.

I haven't had much time lately but it's something to add yes. PRs are welcome :)

JohnPPP commented 3 years ago

As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation ?

FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it.

Thank you

FXDevailly commented 3 years ago

Hi,

The main reason is indeed the computational cost on single CPU and GPU machines.

I haven't had much time lately but it's something to add yes. PRs are welcome :)

Thanks for the quick reply ! I have submitted the corresponding PR :)

FXDevailly commented 3 years ago

As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation ?

FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it.

Thank you

JohnPPP, I have submitted a PR which includes this :)

JohnPPP commented 3 years ago

As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation?

FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it. Thank you

JohnPPP, I have submitted a PR which includes this :)

That is so great!

(How does this work? The PR must be accepted to be implemented in the code? We will have the option to use the revisiting?)

Edit: Nevermind, I have found the PR! I'm new to all this. I need to compare the code of the 5 files to see how you did it!

Thank you :D

FXDevailly commented 3 years ago

As mentioned in the MuZero paper, revisiting past time steps to re-execute the MCTS and update policy 'targets' (child_visits) can help improve sample efficiency. Is there a reason (other than computational cost) for not including this option in the current implementation?

FXDevailly, can you implement this or explain to me how to do it? I'll take a look into it. Thank you

JohnPPP, I have submitted a PR which includes this :)

That is so great!

(How does this work? The PR must be accepted to be implemented in the code? We will have the option to use the revisiting?)

Edit: Nevermind, I have found the PR! I'm new to all this. I need to compare the code of the 5 files to see how you did it!

Thank you :D

Great ! Actually, the option to reanalyze root values (state values) is already included in the repo. I'm simply suggesting to 'reanalyze' policy targets as well. And yes, it will need to be verified of course. There might be errors/inefficiency,

JohnPPP commented 3 years ago

I gotta say FXDevailly, its nuts that you guys share your code, opinions, and improvements so openly. Very rare!

When your PR gets accepted, the "reanalyze" will always be for states and policy? By that I mean is there any way to switch back?

If I set self.use_updated_mcts_value_targets = False, then I'm "reanalyzing" both value and policy. If True, "just" value. Is this it?

Thank you, João

FXDevailly commented 3 years ago

I gotta say FXDevailly, its nuts that you guys share your code, opinions, and improvements so openly. Very rare!

When your PR gets accepted, the "reanalyze" will always be for states and policy? By that I mean is there any way to switch back?

If I set self.use_updated_mcts_value_targets = False, then I'm "reanalyzing" both value and policy. If True, "just" value. Is this it?

Thank you, João

Big thanks to the previous contributors for their great work ! I'm completely new here... The current PR always reanalyzes policy targets. You could easily make it optional with an additional setting in the game files.

No. It just defines how the updated value targets are computed. If False it simply queries the representation + value networks. If True, it uses the values obtained via the re-execution of the MCTS.

In both cases, policy targets are updated according to the re-execution of the MCTS.

werner-duvaud commented 3 years ago

Thanks for the PR! I'm a bit busy, I haven't had the time to review yet.

But for those who want to use it directly (and have a really good CPU :) ), here's how:

git clone https://github.com/werner-duvaud/muzero-general.git
cd muzero-general/
git fetch origin pull/142/head
git checkout -b pullrequest FETCH_HEAD