Monte Carlo control with exploring starts

In the same spirit of previous post, I implemented a new reinforcement learning algorithm as I was following David Silver's RL lectures.

Jupyter notebook here

Specifically, I managed to replicate the optimal Blackjack policy from Sutton's book:

My result after 1e7 simulations (the x-axis is offset by 0.5):

Monte Carlo control means:

Simulate lots of episodes (each episode is a list of state-action pairs, [S0, A0, S1, A1, ...])
Calculate the average accumulated reward for each state-action pair seen, either by first-visit or every-visit; use this as an estimate for q(state, action)
Update policy to be greedy according to the updated q value, pi(state) = argmax(q).

Exploring Starts ("ES") requires that during simulation, every state-action pair must have a non-zero.

At first I ignored ES in my code as I thought it is given by the simulation, I quickly noticed my mistake as I saw a very imbalanced action distribution caused by greedy policy iteration: after only one episode, the q value would already differ between actions in each state, and that will favour all future decisions to one action only. This was fixed by requiring the policy to return a random first action for each episode (search for "exploring starts" in the notebook)

I also kinda forced myself to write more decent Python code this time, and I learned a few things:

Python 3.7.1 introduced "data class", which is very similar to scala's case class, and I like it a lot.
Python now has type annotations, for example, you can write things like this: def f(x:int) -> int.
Abstract classes through ABC. I know this before but never thought about using them.

Also this takes longer than I expected (roughly 5 hours), despite the fact that it is a really simple algorithm. I have read this post about how difficult generally it is to replicate reinforcement learning algorithms, and my biggest takeaway from this exercise is:

Test along the way; test every single smallest step you can imagine.
For a problem of such iterative nature, it's really helpful to step through a few episodes and understand what the updates are doing. I spotted a few very subtle bugs that way.

Onwards!

xysun / blog

Monte Carlo control with exploring starts #4