Grid world example for policy evaluation, policy iteration, and value iteration

I have been following David Silver's reinforcement learning lecture series (slides, videos)

In lecture 3 he talked about 3 things:

Policy evaluation: this is to evaluate the value of a given policy
Policy iteration: this is to improve policy by first taking the values from policy evaluation, then use a greedy procedure (always pick the action that brings most incremental value); then repeat until you get the optimised policy
Value iteration: this is another approach to arrive at the optimised policy; basically you start with terminal states and gradually "fill out" the state space by using Bellman's optimisation equations (as opposed to using Bellman's expectation equations in policy evaluation)

He used a simple grid world example to walk through all 3 concepts, and because I always feel super secure, I decided to actually code it and see if the values match.

They do. Duh.

My jupyter notebook can be found here.

xysun / blog

Grid world example for policy evaluation, policy iteration, and value iteration #3