Specifically, I managed to replicate the optimal Blackjack policy from Sutton's book:
My result after 1e7 simulations (the x-axis is offset by 0.5):
Monte Carlo control means:
Simulate lots of episodes (each episode is a list of state-action pairs, [S0, A0, S1, A1, ...])
Calculate the average accumulated reward for each state-action pair seen, either by first-visit or every-visit; use this as an estimate for q(state, action)
Update policy to be greedy according to the updated q value, pi(state) = argmax(q).
Exploring Starts ("ES") requires that during simulation, every state-action pair must have a non-zero.
At first I ignored ES in my code as I thought it is given by the simulation, I quickly noticed my mistake as I saw a very imbalanced action distribution caused by greedy policy iteration: after only one episode, the q value would already differ between actions in each state, and that will favour all future decisions to one action only. This was fixed by requiring the policy to return a random first action for each episode (search for "exploring starts" in the notebook)
I also kinda forced myself to write more decent Python code this time, and I learned a few things:
Python 3.7.1 introduced "data class", which is very similar to scala's case class, and I like it a lot.
Python now has type annotations, for example, you can write things like this: def f(x:int) -> int.
Abstract classes through ABC. I know this before but never thought about using them.
Also this takes longer than I expected (roughly 5 hours), despite the fact that it is a really simple algorithm. I have read this post about how difficult generally it is to replicate reinforcement learning algorithms, and my biggest takeaway from this exercise is:
Test along the way; test every single smallest step you can imagine.
For a problem of such iterative nature, it's really helpful to step through a few episodes and understand what the updates are doing. I spotted a few very subtle bugs that way.
In the same spirit of previous post, I implemented a new reinforcement learning algorithm as I was following David Silver's RL lectures.
Jupyter notebook here
Specifically, I managed to replicate the optimal Blackjack policy from Sutton's book:![image](https://user-images.githubusercontent.com/1907986/46922475-8f265980-d001-11e8-8424-76ba2534836b.png)
My result after 1e7 simulations (the x-axis is offset by 0.5):![image](https://user-images.githubusercontent.com/1907986/46922477-a06f6600-d001-11e8-8d95-862fc2734d56.png)
Monte Carlo control means:
[S0, A0, S1, A1, ...]
)q(state, action)
q
value,pi(state) = argmax(q)
.Exploring Starts ("ES") requires that during simulation, every state-action pair must have a non-zero.
At first I ignored ES in my code as I thought it is given by the simulation, I quickly noticed my mistake as I saw a very imbalanced action distribution caused by greedy policy iteration: after only one episode, the
q
value would already differ between actions in each state, and that will favour all future decisions to one action only. This was fixed by requiring the policy to return a random first action for each episode (search for "exploring starts" in the notebook)I also kinda forced myself to write more decent Python code this time, and I learned a few things:
def f(x:int) -> int
.ABC
. I know this before but never thought about using them.Also this takes longer than I expected (roughly 5 hours), despite the fact that it is a really simple algorithm. I have read this post about how difficult generally it is to replicate reinforcement learning algorithms, and my biggest takeaway from this exercise is:
Onwards!