yihaosun1124 / OfflineRL-Kit

An elegant PyTorch offline reinforcement learning library for researchers.
MIT License
277 stars 33 forks source link

Conservative Loss Calculations at combo.py Line 181 and cql.py Line 150 #12

Closed aportekila closed 8 months ago

aportekila commented 10 months ago

During conservative loss calculations, a potential error has been identified in the following code snippets:

kwanyoungpark commented 9 months ago

Actually, the first (original) implementation is a correct implementation (Original CQL impl. also samples the actions from next observations for q value of current observation).

When I used the second implementation, I got lower result (95 -> 60 normalized score in hopper-medium-v2 with COMBO). Maybe it is because it is same as 2 * self.calc_pi_values(tmp_obss, tmp_obss) in average, thus Q value of the policy is penalized more than it is expected to be.

yihaosun1124 commented 9 months ago

During conservative loss calculations, a potential error has been identified in the following code snippets:

  • combo.py at Line 181
  • cql.py at Line 150:
next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_obss)

Proposed Correction:

next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_next_obss)

I might be wrong or misinterpret this as well. Could someone confirm if this change is necessary for accurate implementation?

Sorry for my late reply! As explained by GGOSinon, Original CQL implementation samples the actions from next observations for q value of current observation so the first implementation is right. If you use the second implementation, it is equivalent 2*self.calc_pi_values(tmp_obss, tmp_obss) in average.