Closed dragon-wang closed 2 years ago
That confuses me too. However in the paper author's implementation of CQL, he used it too. Can anyone explain this? Thanks!
I think I could give some insight into this question.
CQL adds a conservative item, which is designed to minimize the Q-value of all valid actions. From this motivation, the sampling strategy for action values to minimize should be a uniform distribution. However, this may suffer from the un-efficiency issue.
\pi(a | s) and \pi(a' | s') could give action with high Q-values (true high values, or just high values induced from OOD actions or the so-called overestimates). Thus, these actions are of first priority to check.
I do not think the sampling stragety is important, and I think pi(a|s) should work, so as to pi(a'|s'). However, I have no evidence or supports for this conjecture.
Yes, the sampling strategy is not important. After all, we aim to approximate the logsumexp
using samples. \pi(a | s) and \pi(a' | s') are two distributions we have at hand. Only using uniform and \pi(a | s) is also okay.
In CQL paper's Appendix F, when using importance sampling to compute the log sum exp of Q(s,a) , only sample actions from Unif(a) and pi(a|s), but why here also need to sample actions from pi(a'|s'). This makes me confused.