werner-duvaud / muzero-general

MuZero
https://github.com/werner-duvaud/muzero-general/wiki/MuZero-Documentation
MIT License
2.49k stars 611 forks source link

lunarlander with low reward #69

Closed huajingyun closed 4 years ago

huajingyun commented 4 years ago

When I load the pretrained model here, the error occurred.

So I trained the lunarlander model by myself according to these steps , but I got very low reward (the best total reward is about 40-50) with the defalut params when finished the default 200000 training_steps.

According to the tensorborad log, the total reward is increasing unstably from about -100 to about 40-50 during 0-100k steps, and then decreasing unstably to about -80 during 100k-200k steps.

I want to know How high the lunarlander reward the muzero policy can achieve actually?, and How can I achieve better reward (for example: higher than 200, can it be achieved)?

werner-duvaud commented 4 years ago

Hi,

About the first error, it's normal, sorry, the pretrained weights no longer corresponds to the hyperparameters proposed. I did not have time to update them, (I will try to do it in a few weeks).

About the low reward, first the reward is divided by 3 so when you get 50 you are actually at 150 which is not so bad for lunarlander.

I solved it with a ratio about 0.8, you can set your ratio to 0.8 to try (but it will be slow to train).

When I solved it it was on average at 80 reward (ie 240) and sometimes it reached 100 (ie 300). I couldn't go above 300 but I think it's enough for lunarlander.

Also the proposed configuration may not be optimal. You can use the hyperparameter search option to find the parameters that make lunarlander converge quickly.

huajingyun commented 4 years ago

Hi,

About the first error, it's normal, sorry, the pretrained weights no longer corresponds to the hyperparameters proposed. I did not have time to update them, (I will try to do it in a few weeks).

About the low reward, first the reward is divided by 3 so when you get 50 you are actually at 150 which is not so bad for lunarlander.

I solved it with a ratio about 0.8, you can set your ratio to 0.8 to try (but it will be slow to train).

When I solved it it was on average at 80 reward (ie 240) and sometimes it reached 100 (ie 300). I couldn't go above 300 but I think it's enough for lunarlander.

Also the proposed configuration may not be optimal. You can use the hyperparameter search option to find the parameters that make lunarlander converge quickly.

After modifying the ratio to 0.8 and return reward/3 -> return reward , the reward reached about 300 during 300k timestamps. thx !!~ :)