Hi. Please correct me if I am missing something, but I think that there is a mistake in this line.
Notice that "train_op" calls "apply_gradients", which updates the parameters of the policy network using the log probabilities that the policy outputs when the recorded batch of states is fed. The problem is that the parameters are updated for every time step of the recorded trajectory ("train_op" is called within a temporal for loop), and, due to the fact that the log probabilities are being computed on the fly during the training loop, these probabilities are no longer the ones that were followed during the recorded trajectory. This produces a distribution miss match that violates the on-policy nature of the Reinforce algorithm, and, I believe, may lead to incorrect gradients. I think that the proper way to do the update is to accumulate the gradients of the log probabilities of the entire batch and then perform all the updates.
The same issue applies to the "all_rewards" buffer that is being used for the normalization of the return. Because, as the policy changes, the mean and stdev of the return may change as well, and, therefore, the normalization that is being performed is no longer valid (at least in a strict sense).
As stated before, please correct me if I am wrong. I was simply looking for a Reinforce implementation on Tensorflow and came across this issue.
https://github.com/yukezhu/tensorflow-reinforce/blob/173a04fcf3be1ce37904d3234295ab7855e4fd03/rl/pg_reinforce.py#L192
Hi. Please correct me if I am missing something, but I think that there is a mistake in this line.
Notice that "train_op" calls "apply_gradients", which updates the parameters of the policy network using the log probabilities that the policy outputs when the recorded batch of states is fed. The problem is that the parameters are updated for every time step of the recorded trajectory ("train_op" is called within a temporal for loop), and, due to the fact that the log probabilities are being computed on the fly during the training loop, these probabilities are no longer the ones that were followed during the recorded trajectory. This produces a distribution miss match that violates the on-policy nature of the Reinforce algorithm, and, I believe, may lead to incorrect gradients. I think that the proper way to do the update is to accumulate the gradients of the log probabilities of the entire batch and then perform all the updates.
The same issue applies to the "all_rewards" buffer that is being used for the normalization of the return. Because, as the policy changes, the mean and stdev of the return may change as well, and, therefore, the normalization that is being performed is no longer valid (at least in a strict sense).
As stated before, please correct me if I am wrong. I was simply looking for a Reinforce implementation on Tensorflow and came across this issue.