werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.46k stars 606 forks source link

value/reward transform issue #6

Closed xuxiyang1993 closed 4 years ago

xuxiyang1993 commented 4 years ago

I think the way you transform value/reward is a little mismatch with the original paper at this line (https://github.com/werner-duvaud/muzero-general/blob/fe791e8651645ea05f5b582157b4892588ee56ca/trainer.py#L153)

From the referenced paper (https://arxiv.org/abs/1805.11593), the transformation function should be

transform

So instead of

x = torch.sign(x) * (torch.sqrt(torch.abs(x) + 1) - 1 + 0.001 * x)

the correct formula should be

x = torch.sign(x) * (torch.sqrt(torch.abs(x) + 1) - 1) + .001 * x

xuxiyang1993 commented 4 years ago

Also at this line (https://github.com/werner-duvaud/muzero-general/blob/master/trainer.py#L157), if x has an integer value (which may happen if value/reward is out of support_size range and got clamped), floor and ceil will have same value, then prob=0, then the final logits will be set twice

werner-duvaud commented 4 years ago

You are right, thank you.

Indeed, we had implemented the formula which is written in MuZero paper (appendix Network Architecture), but this formula is different from the one in the paper Observe and Look Further: Achieving Consistent Performance on Atari.

It was fixed in the commit b1bd9ab.

xuxiyang1993 commented 4 years ago

I see, I feel the formula in MuZero paper is a typo when they cite it. After reading the original paper I think transformation and inverse transformation formulas from Lemma A1 & A2 are correct.

By the way thanks for your implementation of the MuZero algorithm, it really helps me a lot!