yandexdataschool / AgentNet

Deep Reinforcement Learning library for humans
http://agentnet.rtfd.org/
Other
300 stars 72 forks source link

Reinforcement Learning Comparison #28

Closed justheuristic closed 8 years ago

justheuristic commented 8 years ago

Compare the existing (and k-step, once implemented) reinforcement learning algorithms and their mixtures

justheuristic commented 8 years ago

Also find out how to get maximum effect out of session pool.

justheuristic commented 8 years ago

Results: https://github.com/justheuristic/AgentNet/tree/master/research/rl_methods_comparison

Brief analysis: Qlearning follows a somewhat antsy pattern in learning, but learns fastest of all. It is quick to reach plateau (5.something). The main drawback is that it tends to fall down on avg reward at times after reaching the plateau. This probably happens when agent over-trains on optimal behaviour and forgets what to do in case of random suboptimal action (e.g. caused by e-greedy policy).

Naive and sarsa both lack this problem, probably because they are on-policy algorithms, but they converge slower especially when getting closer to the plateau (+3. avg reward vs +5 plateau)

They (naive, sarsa) also happen to stick to suboptimal policy areas for some (longer) time before surpassing them. This manifests itself with several flat regions at rw=0 (end session on first turn) for both and rw=2 for naive (some unexplored trivial strategy)

The mixture of naive and q-learning (equal weights) is both fast in reaching the plateau and free of Qlearning's downfalls. However, it is slightly slower (perhaps due to sheer randomness) than Q-learning and still sits at rw=0 suboptimal region for some time.

Sessions pool at it's current state works too slow: the pool stores several high-reward lucky sessions, thus the agent gets trained on this "overly optimistic world" and pays little attention to optimizing actual expected reward. Such model converges too slowly and does not reach the plateau even after a million iterations The next step is about using sessions pool as means of after-training.