Closed StepNeverStop closed 4 years ago
Do you suggest dividing max w
where w are the weights of the whole replay buffer? If that's what you're asking, here's the answer:
max w
is to scale the importance sampling ratio done to avoid potentially increasing the step size. Therefore, it is sufficient to divide the maximum ratio of samples.Since I've been using the global maximum, so when I see your code there has a confusion. Actually,I usually use a variable to keep the max and min value when the update op may change it. After listen to your opinion, I will test whether they have any impact on algorithm performance. thx
Also, you can use:
def _compute_IS_ratios(self, probabilities):
IS_ratios = np.power(np.min(probabilities) / probabilities, self.beta)
return IS_ratios
instead of:
def _compute_IS_ratios(self, N, probabilities):
IS_ratios = np.power(probabilities * N, -self.beta)
IS_ratios /= np.max(IS_ratios) # normalize ratios to avoid scaling the update upward
return IS_ratios
And, it seem if you use and update the highest priority data, you will assign a outdated and inaccurate top value to the new data.
Your first _compute_IS_ratios
is interesting, but it does not seem like an importance sampling ratio. Would you mind explaining your reasoning behind it a bit? I can make sense of it a bit if removing the negative sign before self.beta
-- which equals to replacing 1/(N**beta*max(prob))
with min(prob)**beta
. But this is more like a hack, and I cannot see the need for it.
I see your point about
it seem if you use and update the highest priority data, you will assign a outdated and inaccurate top value to the new data.
You're totally right. But could you suggest any efficient way to update it? If calling np.max
over the whole memory every time we update the priority, the cost is huge. On the other hand, I don't think this matters much since the priority for the new data will soon be corrected after it's consumed by the network. Furthermore, for distributed algorithms, such as Ape-X, we generally do not use top priority for the new data.
First,
$$w{j}=\left ( N \cdot P(j)\right )^{-\beta}/max{i}w{i}$$
and
$$max{i}w{i}=max{i}\left ( N \cdot P(i)\right )^{-\beta}=\left ( min{i}N \cdot P(i)\right )^{-\beta}=\left ( N \cdot P{min}\right )^{-\beta}$$
so,
$$w{j}=\left ( \frac{p{min}}{p_{i}} \right )^{\beta}$$
you don't need the parameter of N
. But it doesn't matter, I just see this when I am browsing the code for PER relevantly.
Second, I have no idea of an efficient way to find max(priorities)
. For my code, I just use two variables to record the min value and max value, but it still not correct when updating its priority after use it:
def update(self, priority, episode):
'''
input: priorities
'''
assert hasattr(priority, '__len__'), 'priority must have attribute of len()'
assert len(priority) == len(self.last_indexs), 'length between priority and last_indexs must equal'
self.beta += self.beta_interval * episode
priority = np.power(np.abs(priority) + self.epsilon, self.alpha)
self.min_p = min(self.min_p, priority.min())
self.max_p = max(self.max_p, priority.max())
for i in range(len(priority)):
self.tree._updatetree(self.last_indexs[i], priority[i])
One possible way is:
Record the first few maximum values and a few minimum values, like [p_max, p_second_max, ...]
, and update it accordingly for each priority update. Because it is a little unlikely that the priorities of a batch of samples is just the maximum or minimum. It may be feasible, I don't know.
Thanks for sharing your reasoning. That was an interesting observation!
BTW, I think leave the top priority as it is will be fine. I personally don't think that's a big deal in practice since only use it to initialize the priority of new data, which will be refreshed once they are consumed by the network. The potential problem will arise when there are too many new data like in distributed methods. In those cases initializing priorities of new data too high will cause bad sampling process. But, as I mentioned before, in those cases we don't use top priority to as initialization. BTW, in an early version, I tested constant top priority and it also performed well. In fact, I could not even see the difference between these two when I monitored the priority from the tensorboard. You may want to monitor your priority distribution for your environment, and whereby make your decision.
Yep, I totally agree with your opinion. In fact, n_step replay buffer seems to be more effective than PER experimently. Appreciate for your kind communication.
@StepNeverStop, I spot that you are interested in RL as well as I do. Would you like to leave me a contact so that maybe we can cooperate to participate some competition? By the way, I'm also a Chinese.
That would be awesome!
Since I cannot find any contact information in your home page, so you can find my e-mail address first at there.
I hope we can make progress together.
Hi, you can use
instead of:
It seems you use the max value of samples, but why not the max value of the whole replay buffer?