xbpeng / awr

Implementation of advantage-weighted regression.
MIT License
178 stars 36 forks source link

Train_Return vs Test_Return #4

Open masonjar-source opened 3 years ago

masonjar-source commented 3 years ago

Hi, Thank you for sharing the repo!

I was wondering how the Train_Return and Test_Return is calculated and what the difference between the two. I see that one is using norm_a_tf and sample_a_tf in the code.

xbpeng commented 3 years ago

train return corresponds to perform during training rollouts where a stochastic policy is used. test return corresponds to rollouts that use a deterministic policy.

masonjar-source commented 3 years ago

Thank you for the clarification! I also noticed that the test_return is only calculated once every few steps and train_return is calculated at every step. I'm not too familiar with these metrics. But why was the choice made to have train rollout be stochastic and test rollout be deterministic. And why is test_return updated every few steps instead of at every step? Is this just standard for measuring offline rl?

xbpeng commented 3 years ago

by step, do you mean simulation step or update step? For these continuous control tasks, performance is usually higher with a deterministic policy, so it's common practice to evaluate with a deterministic policy. Training requires stochastic rollouts in order to estimate the policy gradients.