Open merv801 opened 4 years ago
Hm, that is strange. The only place that clip_rewards
is applied is here:
if self.args.clip_rewards:
# clip reward to one of {-1, 0, 1}
step_reward = np.sign(step_reward)
As you say, in pong the rewards are already in this set so it should have no effect. I'd guess that maybe there is just some random variation in the runs that caused the behavior you see? I would try running them again.
Actually that isn't strictly true, step_reward
is actually the sum of rewards over steps_to_skip
timesteps. But I don't think it's possible to get multiple rewards within a few steps in Pong - so this still shouldn't matter.
Thanks for your response. I ran it again with step_reward=False again and this time it is working good, so the first time was indeed a random variation. However it seems quite strange. I didn't expect to see such a difference between two runs. I have heard that PPO is relatively stable.( the first time agent got stuck in the -8 to 2 range of rewards)
Yeah, it's possible that the hyperparameters I tested with aren't great (didn't tune them at all), or maybe it would work more reliably with frame stacking (#2). But RL does generally have a good deal of variation even in the more stable algorithms, so who knows.
Hello. I have ran you algorithm in the Pong game 2 times for about 3k steps each. Once with the clip_rewards=True and the other time with clip_rewards=False. However in the case of clip_rewards=False it did not progress that much but for clip_rewards=True the results are like yours. I thought that in the pong environment setting clip_rewards should not have any effect because the rewards are already 0, +1 or -1. Do you have any idea what is the cause? Thanks