Closed huajingyun closed 4 years ago
Hi,
About the first error, it's normal, sorry, the pretrained weights no longer corresponds to the hyperparameters proposed. I did not have time to update them, (I will try to do it in a few weeks).
About the low reward, first the reward is divided by 3 so when you get 50 you are actually at 150 which is not so bad for lunarlander.
I solved it with a ratio about 0.8, you can set your ratio to 0.8 to try (but it will be slow to train).
When I solved it it was on average at 80 reward (ie 240) and sometimes it reached 100 (ie 300). I couldn't go above 300 but I think it's enough for lunarlander.
Also the proposed configuration may not be optimal. You can use the hyperparameter search option to find the parameters that make lunarlander converge quickly.
Hi,
About the first error, it's normal, sorry, the pretrained weights no longer corresponds to the hyperparameters proposed. I did not have time to update them, (I will try to do it in a few weeks).
About the low reward, first the reward is divided by 3 so when you get 50 you are actually at 150 which is not so bad for lunarlander.
I solved it with a ratio about 0.8, you can set your ratio to 0.8 to try (but it will be slow to train).
When I solved it it was on average at 80 reward (ie 240) and sometimes it reached 100 (ie 300). I couldn't go above 300 but I think it's enough for lunarlander.
Also the proposed configuration may not be optimal. You can use the hyperparameter search option to find the parameters that make lunarlander converge quickly.
After modifying the ratio to 0.8
and return reward/3
-> return reward
, the reward reached about 300 during 300k timestamps.
thx !!~ :)
When I load the pretrained model here, the error occurred.
So I trained the lunarlander model by myself according to these steps , but I got very low reward (the best total reward is about 40-50) with the defalut params when finished the default 200000 training_steps.
According to the tensorborad log, the total reward is increasing unstably from about -100 to about 40-50 during 0-100k steps, and then decreasing unstably to about -80 during 100k-200k steps.
I want to know
How high the lunarlander reward the muzero policy can achieve actually?
, andHow can I achieve better reward (for example: higher than 200, can it be achieved)
?