Open PeiYingjun opened 6 years ago
sorry, I mean I think it should be
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( oldaction_dist) * tf.log(tf.stop_gradient(oldaction_dist + eps) / (oldaction_dist + eps))) / Nf
All right, after a quick analysis, I think it' s reasonable to use the first definition of kl_first, yet I'm still confused about the losses, why do we try to minimize three values?
thanks for implementation of trpo, there exist some details that do not make sense to me so far I can't see why kl_firstfixed is defined as following
kl_firstfixed = tf.reduce_sum(tf.stop_gradient( action_dist_n) * tf.log(tf.stop_gradient(action_dist_n + eps) / (action_dist_n + eps))) / Nf
seems that we didn't make use of anything of oldaction_dist shouldn't it bekl_firstfixed = tf.reduce_sum(tf.stop_gradient( oldaction_dist) * tf.log(tf.stop_gradient(oldaction_dist + eps) / (action_dist_n + eps))) / Nf
? besides, why does losses contain the entropy of action_dist_n? why must it be minimized?