Thanks for releasing codes of this wonderful project!
I have a question about the value network. In net.py, the new_value is predicted by observing fake_output and new_states. Let s_t denote fake_input, and then fake_output is s_{t+1}. The new_states contain the ation a_t that transfers s_t to s_{t+1}. Therefore, it seems the codes are predicting Q(s_t, a_{t-1}), Q(s_{t+1}, a_t) rather than Q(s_t, a_t), Q(s_{t+1}, a_{t+1}). If so, I am confused how the policy gradients are calculated (e.g., Eqn. (7) in the paper). I might get something wrong. I'd appreciate it if you could help me clarify this question. Thanks!
Hi Yuanming,
Thanks for releasing codes of this wonderful project!
I have a question about the value network. In
net.py
, thenew_value
is predicted by observingfake_output
andnew_states
. Lets_t
denotefake_input
, and thenfake_output
iss_{t+1}
. Thenew_states
contain the ationa_t
that transferss_t
tos_{t+1}
. Therefore, it seems the codes are predictingQ(s_t, a_{t-1})
,Q(s_{t+1}, a_t)
rather thanQ(s_t, a_t)
,Q(s_{t+1}, a_{t+1})
. If so, I am confused how the policy gradients are calculated (e.g., Eqn. (7) in the paper). I might get something wrong. I'd appreciate it if you could help me clarify this question. Thanks!Yu Ke