Implementation of reinforcement learning algorithms for the OpenAI Gym environment LunarLander-v2
gym==0.21.0
imageio==2.13.5
matplotlib==3.5.1
numpy==1.22.0
Pillow==9.0.1
torch==1.10.1+cu102
tqdm==4.62.3
Training instructions are included in the Jupyter notebook.
I implemented an early stop function which, when the average of the last 100 scores reach a target value, will stop the training process and plot the result. I find that the target score of 250 usually produces more consistent landings and higher total rewards.
Setting the target score to a lower value like 200, will result in more misses in the final demo. Setting the target score too high, however, will sometimes result in the average score never reaching the target value, which takes more time to train and will not necessarily produce a better result.
Target=200, gamma=0.99 | Target=250, gamma=0.99 |
---|---|
The discount factor determines the importance of future rewards, and the value should be .
Setting it too low will make it "short-sighted", only consider rewards nearest to its state. Setting it higher will make it consider more long-term rewards.
If the discount factor is set equal to or greater than 1, it may cause and to diverge.
Experiments as below:
Target=230, gamma=0.9 | Target=230, gamma=1.3 | |
---|---|---|
Training Time | 2:02:58, 5000 episodes | 12:09, 5000 episodes |
Training curve | ||
Result |
We can see that, set to suboptimal values will result in slow training time and no convergence, and cause the ship to continue hovering and not land.
set to greater than 1 will also cause it to fail to converge, and the ship only fired one side rocket and flew out of control.