Open xysun opened 5 years ago
My summary notes on model free prediction and control.
Code:
V(S)
Q(S,A)
(S,A)
pi(S)
argmax(Q(S,A))
Q(S,A) <- Q(S,A) + alpha * (R + gamma*Q(S',A') - Q(S,A))
S'
A'
alpha
R + gamma*Q(S',A')
Q
Q(S',A')
max(Q(S'))
Q(S',A_pi)
A_pi
A
A is chosen from Q1 + Q2 (average or sum) S,A,R',S',A' Q1(S,A) <- Q1(S,A) + alpha * (R' + gamma*Q2(argmax(Q1(S')) - Q1(S,A))) alternate updating Q1 and Q2
V
1-lambda
(1-lambda)*lambda
(1-lambda)*lambda^2
My summary notes on model free prediction and control.
Code:
Model free prediction and control
V(S)
orQ(S,A)
only updated after each episode (all episodes must terminate); "online": update per time stepMonte Carlo methods
Q(S,A)
for every(S,A)
pair visitedpi(S)
to beargmax(Q(S,A))
(S,A)
may never be visited(S,A)
has a nonzero probability being selected at startTemporal Difference methods
Q(S,A) <- Q(S,A) + alpha * (R + gamma*Q(S',A') - Q(S,A))
whereS'
is next state,A'
is chosen following current policyalpha
comes from incremental implementation of averaging, where alpha = 1/k; effectively learning rateR + gamma*Q(S',A')
; "TD error" = "TD target" -Q
Q(S',A')
bymax(Q(S'))
; off policyQ(S',A')
byQ(S',A_pi)
whereA_pi
is following target policy (A
is generated by behaviour policy)Q
orV
only updated in the end of batch by sum of total incremental updates1-lambda
,(1-lambda)*lambda
,(1-lambda)*lambda^2
, etc