Conservative Loss Calculations at combo.py Line 181 and cql.py Line 150

aportekila commented 10 months ago

During conservative loss calculations, a potential error has been identified in the following code snippets:

combo.py at Line 181

cql.py at Line 150:

next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_obss)

Proposed Correction:

next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_next_obss)

I might be wrong or misinterpret this as well. Could someone confirm if this change is necessary for accurate implementation?

kwanyoungpark commented 9 months ago

Actually, the first (original) implementation is a correct implementation (Original CQL impl. also samples the actions from next observations for q value of current observation).

When I used the second implementation, I got lower result (95 -> 60 normalized score in hopper-medium-v2 with COMBO). Maybe it is because it is same as 2 * self.calc_pi_values(tmp_obss, tmp_obss) in average, thus Q value of the policy is penalized more than it is expected to be.

yihaosun1124 commented 9 months ago

During conservative loss calculations, a potential error has been identified in the following code snippets:

combo.py at Line 181

cql.py at Line 150:
next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_obss)
Proposed Correction:
next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_next_obss)
I might be wrong or misinterpret this as well. Could someone confirm if this change is necessary for accurate implementation?

Sorry for my late reply! As explained by GGOSinon, Original CQL implementation samples the actions from next observations for q value of current observation so the first implementation is right. If you use the second implementation, it is equivalent 2*self.calc_pi_values(tmp_obss, tmp_obss) in average.

yihaosun1124 / OfflineRL-Kit

Conservative Loss Calculations at combo.py Line 181 and cql.py Line 150 #12