yukezhu / tensorflow-reinforce

Implementations of Reinforcement Learning Models in Tensorflow
MIT License
487 stars 136 forks source link

PG Reinforce #8

Closed kris-singh closed 4 years ago

kris-singh commented 7 years ago

While using Policy Gradients for Reinforce Learning you are using discounted reward. But I think in the david silver lecture he says the reward are sampled from a distribution. But why do you use discounted reward.

kris-singh commented 7 years ago

Okay I did some looking around and found that if you sampled rewards(as discussed in Silvers Lecture) at each time step in the episode. We would require a lot training iterations. But I think the formula you are using is alpha_ij ( r - b_ij) gradient. Right ??