xysun / blog

Mainly my paper reading notes
5 stars 0 forks source link

Monte Carlo control with exploring starts #4

Open xysun opened 5 years ago

xysun commented 5 years ago

In the same spirit of previous post, I implemented a new reinforcement learning algorithm as I was following David Silver's RL lectures.

Jupyter notebook here

Specifically, I managed to replicate the optimal Blackjack policy from Sutton's book: image

My result after 1e7 simulations (the x-axis is offset by 0.5): image

Monte Carlo control means:

  1. Simulate lots of episodes (each episode is a list of state-action pairs, [S0, A0, S1, A1, ...])
  2. Calculate the average accumulated reward for each state-action pair seen, either by first-visit or every-visit; use this as an estimate for q(state, action)
  3. Update policy to be greedy according to the updated q value, pi(state) = argmax(q).

Exploring Starts ("ES") requires that during simulation, every state-action pair must have a non-zero.

At first I ignored ES in my code as I thought it is given by the simulation, I quickly noticed my mistake as I saw a very imbalanced action distribution caused by greedy policy iteration: after only one episode, the q value would already differ between actions in each state, and that will favour all future decisions to one action only. This was fixed by requiring the policy to return a random first action for each episode (search for "exploring starts" in the notebook)

I also kinda forced myself to write more decent Python code this time, and I learned a few things:

Also this takes longer than I expected (roughly 5 hours), despite the fact that it is a really simple algorithm. I have read this post about how difficult generally it is to replicate reinforcement learning algorithms, and my biggest takeaway from this exercise is:

Onwards!