Closed aportekila closed 8 months ago
Actually, the first (original) implementation is a correct implementation (Original CQL impl. also samples the actions from next observations for q value of current observation).
When I used the second implementation, I got lower result (95 -> 60 normalized score in hopper-medium-v2 with COMBO). Maybe it is because it is same as 2 * self.calc_pi_values(tmp_obss, tmp_obss) in average, thus Q value of the policy is penalized more than it is expected to be.
During conservative loss calculations, a potential error has been identified in the following code snippets:
combo.py
at Line 181cql.py
at Line 150:next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_obss)
Proposed Correction:
next_obs_pi_value1, next_obs_pi_value2 = self.calc_pi_values(tmp_next_obss, tmp_next_obss)
I might be wrong or misinterpret this as well. Could someone confirm if this change is necessary for accurate implementation?
Sorry for my late reply! As explained by GGOSinon, Original CQL implementation samples the actions from next observations for q value of current observation so the first implementation is right. If you use the second implementation, it is equivalent 2*self.calc_pi_values(tmp_obss, tmp_obss) in average.
During conservative loss calculations, a potential error has been identified in the following code snippets:
combo.py
at Line 181cql.py
at Line 150:Proposed Correction:
I might be wrong or misinterpret this as well. Could someone confirm if this change is necessary for accurate implementation?