x-tu / GGF-wcMDP

0 stars 0 forks source link

Fix Q learning #19

Closed x-tu closed 1 year ago

x-tu commented 1 year ago

The previous version optimizes the expected GGF rewards instead of the GGF of expected rewards (our fair optimization objective). Therefore it is optimizing the lower bound of the optimal values.

[TO CHECK] Here is the proposed initial version of GGF Q-Learning:

Screenshot 2023-08-18 at 02 45 38