I have been following David Silver's reinforcement learning lecture series (slides, videos)
In lecture 3 he talked about 3 things:
Policy evaluation: this is to evaluate the value of a given policy
Policy iteration: this is to improve policy by first taking the values from policy evaluation, then use a greedy procedure (always pick the action that brings most incremental value); then repeat until you get the optimised policy
Value iteration: this is another approach to arrive at the optimised policy; basically you start with terminal states and gradually "fill out" the state space by using Bellman's optimisation equations (as opposed to using Bellman's expectation equations in policy evaluation)
He used a simple grid world example to walk through all 3 concepts, and because I always feel super secure, I decided to actually code it and see if the values match.
I have been following David Silver's reinforcement learning lecture series (slides, videos)
In lecture 3 he talked about 3 things:
He used a simple grid world example to walk through all 3 concepts, and because I always feel super secure, I decided to actually code it and see if the values match.
They do. Duh.
My jupyter notebook can be found here.