zplizzi / pytorch-ppo

Simple, readable, yet full-featured implementation of PPO in Pytorch
44 stars 8 forks source link

reward shaping in Atari #3

Open merv801 opened 4 years ago

merv801 commented 4 years ago

Hello. I have ran you algorithm in the Pong game 2 times for about 3k steps each. Once with the clip_rewards=True and the other time with clip_rewards=False. However in the case of clip_rewards=False it did not progress that much but for clip_rewards=True the results are like yours. I thought that in the pong environment setting clip_rewards should not have any effect because the rewards are already 0, +1 or -1. Do you have any idea what is the cause? Thanks

zplizzi commented 4 years ago

Hm, that is strange. The only place that clip_rewards is applied is here:

if self.args.clip_rewards:
    # clip reward to one of {-1, 0, 1}
    step_reward = np.sign(step_reward)

As you say, in pong the rewards are already in this set so it should have no effect. I'd guess that maybe there is just some random variation in the runs that caused the behavior you see? I would try running them again.

zplizzi commented 4 years ago

Actually that isn't strictly true, step_reward is actually the sum of rewards over steps_to_skip timesteps. But I don't think it's possible to get multiple rewards within a few steps in Pong - so this still shouldn't matter.

merv801 commented 4 years ago

Thanks for your response. I ran it again with step_reward=False again and this time it is working good, so the first time was indeed a random variation. However it seems quite strange. I didn't expect to see such a difference between two runs. I have heard that PPO is relatively stable.( the first time agent got stuck in the -8 to 2 range of rewards)

zplizzi commented 4 years ago

Yeah, it's possible that the hyperparameters I tested with aren't great (didn't tune them at all), or maybe it would work more reliably with frame stacking (#2). But RL does generally have a good deal of variation even in the more stable algorithms, so who knows.