Closed ziofil closed 3 years ago
Hi,
I think that this part is not detailed in the paper, I am interested if there are similar things in other papers.
We have made the target policy uniform for the absorbing states to avoid biasing the model towards a particular distribution. But we also experimented with only 0's and it didn't make any noticeable difference in performance so I don't know which is best.
Shouldn't one use an artificial target "policy" made of all zeros here? Then the gradients would be zero, right? 🤔
https://github.com/werner-duvaud/muzero-general/blob/ed3fc8a4532bd4afe564c29c2374e27c0e17544e/replay_buffer.py#L272