xbpeng / awr

Implementation of advantage-weighted regression.
MIT License
178 stars 36 forks source link

Beta (temperature) set to 1.0 #3

Closed TrentBrick closed 4 years ago

TrentBrick commented 4 years ago

In the paper it says the Beta value is 0.05 but in the code here for all environments it is 1.0. Can you provide an explanation for this please?

Thank you. Trenton

xbpeng commented 4 years ago

We are using advantage normalization in this implementation, so that takes care of the scaling. You can also remove advantage normalization and use a fixed beta, and that should work as well.

TrentBrick commented 4 years ago

@xbpeng thank you for the quick reply.

Is there any performance difference between advantage normalization and fixed beta?

Also if the answer to the above is yes... as you are an author on the Reward Conditioned Policies paper, do you know if RCP-A uses advantage normalization?

Thanks again, Trenton

TrentBrick commented 4 years ago

I have tried to implement the unnormalized version with Beta 0.05 and weight clipping of 20. However, this does not seem to work for the LunarLander-v2 because rewards that are negative like -200 are: exp(-200/0.05)=0 creating all zero weights.

Any help explaining this is much appreciated.

Thank you, Trenton

xbpeng commented 4 years ago

Are you using advantages or returns for the weights? The returns may have a large magnitude, but the advantages will usually be smaller. Some of the gym envs can have very different scales for the reward function, so advantage normalization can help a lot in mitigating some of that env specific scaling.

TrentBrick commented 4 years ago

Advantages for the weights as is stated in the paper (right?). And the scaling makes sense with regards to different environments and having a beta value that works for all of them.

Thank you, Trenton