young-geng / CQL

Conservative Q Learning on top of SAC
MIT License
118 stars 24 forks source link

Some questions on CQL #6

Open dbsxdbsx opened 2 years ago

dbsxdbsx commented 2 years ago

1.For behavior cloning, the update formula policy_loss = (alpha*log_pi - log_probs).mean(), I wonder why using log_probs , but not q-value here?

  1. When using Lagrange, do alpha_prime and cql_min_q_weight refer to the same thing, and shouldn't alpha_prime be updated before updating Q_loss, according to formula 30 from CQL paper?
  2. Is twin Q function still essential? From my opinion, since q-value could be guaranteed to be a lower bound of true Q value, the twin Q function outputs are needless. Am I right?
  3. What is cql_temp in code? The value is always 1, and what is it used for if taking a different value?

(I know some code are referring to CQL, but since the author is no longer active, I asked here.)