Closed kris-singh closed 4 years ago
Okay I did some looking around and found that if you sampled rewards(as discussed in Silvers Lecture) at each time step in the episode. We would require a lot training iterations. But I think the formula you are using is alpha_ij ( r - b_ij) gradient. Right ??
While using Policy Gradients for Reinforce Learning you are using discounted reward. But I think in the david silver lecture he says the reward are sampled from a distribution. But why do you use discounted reward.