xlnwel / model-free-algorithms

TD3, SAC, IQN, Rainbow, PPO, Ape-X and etc. in TF1.x
60 stars 10 forks source link

PER #1

Closed StepNeverStop closed 4 years ago

StepNeverStop commented 4 years ago

Hi, you can use

image

instead of:

def _compute_IS_ratios(self, N, probabilities):
        IS_ratios = np.power(probabilities * N, -self.beta)
        IS_ratios /= np.max(IS_ratios)  # normalize ratios to avoid scaling the update upward

        return IS_ratios

It seems you use the max value of samples, but why not the max value of the whole replay buffer?

xlnwel commented 4 years ago

Do you suggest dividing max w where w are the weights of the whole replay buffer? If that's what you're asking, here's the answer:

  1. I cannot see the reason to do so, since the purpose we divide max w is to scale the importance sampling ratio done to avoid potentially increasing the step size. Therefore, it is sufficient to divide the maximum ratio of samples.
  2. computing max value from the replay buffer is computationally expensive since we constantly add new data to the buffer and update priorities
StepNeverStop commented 4 years ago

Since I've been using the global maximum, so when I see your code there has a confusion. Actually,I usually use a variable to keep the max and min value when the update op may change it. After listen to your opinion, I will test whether they have any impact on algorithm performance. thx

Also, you can use:

def _compute_IS_ratios(self, probabilities):
        IS_ratios = np.power(np.min(probabilities) / probabilities, self.beta)
        return IS_ratios

instead of:

def _compute_IS_ratios(self, N, probabilities):
        IS_ratios = np.power(probabilities * N, -self.beta)
        IS_ratios /= np.max(IS_ratios)  # normalize ratios to avoid scaling the update upward

        return IS_ratios

And, it seem if you use and update the highest priority data, you will assign a outdated and inaccurate top value to the new data.

xlnwel commented 4 years ago

Your first _compute_IS_ratios is interesting, but it does not seem like an importance sampling ratio. Would you mind explaining your reasoning behind it a bit? I can make sense of it a bit if removing the negative sign before self.beta -- which equals to replacing 1/(N**beta*max(prob)) with min(prob)**beta. But this is more like a hack, and I cannot see the need for it.

I see your point about

it seem if you use and update the highest priority data, you will assign a outdated and inaccurate top value to the new data.

You're totally right. But could you suggest any efficient way to update it? If calling np.max over the whole memory every time we update the priority, the cost is huge. On the other hand, I don't think this matters much since the priority for the new data will soon be corrected after it's consumed by the network. Furthermore, for distributed algorithms, such as Ape-X, we generally do not use top priority for the new data.

StepNeverStop commented 4 years ago

First, $$w{j}=\left ( N \cdot P(j)\right )^{-\beta}/max{i}w{i}$$ and $$max{i}w{i}=max{i}\left ( N \cdot P(i)\right )^{-\beta}=\left ( min{i}N \cdot P(i)\right )^{-\beta}=\left ( N \cdot P{min}\right )^{-\beta}$$ so, $$w{j}=\left ( \frac{p{min}}{p_{i}} \right )^{\beta}$$ you don't need the parameter of N. But it doesn't matter, I just see this when I am browsing the code for PER relevantly.

Second, I have no idea of an efficient way to find max(priorities). For my code, I just use two variables to record the min value and max value, but it still not correct when updating its priority after use it:

def update(self, priority, episode):
        '''
        input: priorities
        '''
        assert hasattr(priority, '__len__'), 'priority must have attribute of len()'
        assert len(priority) == len(self.last_indexs), 'length between priority and last_indexs must equal'
        self.beta += self.beta_interval * episode
        priority = np.power(np.abs(priority) + self.epsilon, self.alpha)
        self.min_p = min(self.min_p, priority.min())
        self.max_p = max(self.max_p, priority.max())
        for i in range(len(priority)):
            self.tree._updatetree(self.last_indexs[i], priority[i])

One possible way is: Record the first few maximum values and a few minimum values, like [p_max, p_second_max, ...], and update it accordingly for each priority update. Because it is a little unlikely that the priorities of a batch of samples is just the maximum or minimum. It may be feasible, I don't know.

xlnwel commented 4 years ago

Thanks for sharing your reasoning. That was an interesting observation!

BTW, I think leave the top priority as it is will be fine. I personally don't think that's a big deal in practice since only use it to initialize the priority of new data, which will be refreshed once they are consumed by the network. The potential problem will arise when there are too many new data like in distributed methods. In those cases initializing priorities of new data too high will cause bad sampling process. But, as I mentioned before, in those cases we don't use top priority to as initialization. BTW, in an early version, I tested constant top priority and it also performed well. In fact, I could not even see the difference between these two when I monitored the priority from the tensorboard. You may want to monitor your priority distribution for your environment, and whereby make your decision.

StepNeverStop commented 4 years ago

Yep, I totally agree with your opinion. In fact, n_step replay buffer seems to be more effective than PER experimently. Appreciate for your kind communication.

xlnwel commented 4 years ago

@StepNeverStop, I spot that you are interested in RL as well as I do. Would you like to leave me a contact so that maybe we can cooperate to participate some competition? By the way, I'm also a Chinese.

StepNeverStop commented 4 years ago

That would be awesome!

Since I cannot find any contact information in your home page, so you can find my e-mail address first at there.

I hope we can make progress together.