Closed guotong1988 closed 7 years ago
@guotong1988 I do not quite understand your question. But every transition is recorded as a 4 element tuple in the deque. You should note that the reward is a instant reflection of the environment. I really think you should learn the very basic concept of Markov Reward Process and Markov Decision Process first before reinforcement learning.
看这里,reward为1和-1都只能被sample出来然后feed进去,这样只有被sample出来才能起作用
@guotong1988 i see, you are right, thus the author implied that prioritized experience replay can potentially optimize your problem by improving the probability of a more significant transition being chosen in another issue.
Oh,it's here,Thank you !
@guotong1988 You are welcome,:)
I find from here that all the rewards are add into the deque. We need to sample the 1 and -1 reward from the deque to use them. So do you think it may be slow.
In Chinese:是不是reward为1和-1的情况也都放在deque里,那么reward为1和-1的被sample出来的几率岂不是很低,反馈就会很慢?
Thank you @yenchenlin