Model free prediction and control

Algorithm	online/offline	bootstrap	estimate
Monte Carlo	offline	no bootstrap	MSE
TD	online	bootstrap	MLE of MDP

"model free": Does not know the full MDP dynamics.
"offline": V(S) or Q(S,A) only updated after each episode (all episodes must terminate); "online": update per time step

Monte Carlo methods

empirical mean of sum(future rewards); first-visit, every-visit
Monte Carlo control: after each episode
- update Q(S,A) for every (S,A) pair visited
- update pi(S) to be argmax(Q(S,A))
Exploration problem: some (S,A) may never be visited
- "Exploring starts": make sure every (S,A) has a nonzero probability being selected at start
- use epsilon-greedy policy with decaying epsilon ("on-policy")
- "off-policy" learning: use a different exploring behaviour policy, then use importance sampling to update target policy
- can have super high variance

Temporal Difference methods

"temporal difference": learn during "time" steps, hence "temporal"
update: Q(S,A) <- Q(S,A) + alpha * (R + gamma*Q(S',A') - Q(S,A)) where S' is next state, A' is chosen following current policy
- also called TD(0): one step lookahead
- much lower variance than MC
The idea of alpha comes from incremental implementation of averaging, where alpha = 1/k; effectively learning rate
"TD target" = R + gamma*Q(S',A'); "TD error" = "TD target" - Q
control:
- initialize pi to be epsilon-soft with decaying epsilon
- sarsa: follow TD update; on-policy
- Q-learning: replace Q(S',A') by max(Q(S')); off policy
- general Q-learning, replace Q(S',A') by Q(S',A_pi) where A_pi is following target policy (A is generated by behaviour policy)
Both SARSA and Q-learning have maximization bias (max(estimates) > max(true))
- consider: all true values are 0, simulated will be some >0, some <0, max(simulated) will be >0
- double Q-learning: separate value prediction and greedy action
```
A is chosen from Q1 + Q2 (average or sum)
S,A,R',S',A'
Q1(S,A) <- Q1(S,A) + alpha * (R' + gamma*Q2(argmax(Q1(S')) - Q1(S,A)))
alternate updating Q1 and Q2
```
batch updates
- repeat experience (when episodes are limited)
- Q or V only updated in the end of batch by sum of total incremental updates
TD(lambda)
- TD with n step look head; effectively MC when n is end of episode
- TD(lambda): weighted average over different n, weight decayed by lambda: 1-lambda, (1-lambda)*lambda, (1-lambda)*lambda^2, etc
- TD(0): 1 step look ahead
- TD(1): full MC
- SARSA(lambda): similar idea
- forward view requires waiting for episode to end (offline)
- backward view with eligibility trace means can do online learning, equivalent to forward view (with slight complications)

xysun / blog

Summary on model free prediction and control #5

Model free prediction and control

Monte Carlo methods

Temporal Difference methods