Question about value network

Hi Yuanming,

Thanks for releasing codes of this wonderful project!

I have a question about the value network. In net.py, the new_value is predicted by observing fake_output and new_states. Let s_t denote fake_input, and then fake_output is s_{t+1}. The new_states contain the ation a_t that transfers s_t to s_{t+1}. Therefore, it seems the codes are predicting Q(s_t, a_{t-1}), Q(s_{t+1}, a_t) rather than Q(s_t, a_t), Q(s_{t+1}, a_{t+1}). If so, I am confused how the policy gradients are calculated (e.g., Eqn. (7) in the paper). I might get something wrong. I'd appreciate it if you could help me clarify this question. Thanks!

Yu Ke

yuanming-hu / exposure

Question about value network #45