The previous version optimizes the expected GGF rewards instead of the GGF of expected rewards (our fair optimization objective). Therefore it is optimizing the lower bound of the optimal values.
[TO CHECK] Here is the proposed initial version of GGF Q-Learning:
The previous version optimizes the expected GGF rewards instead of the GGF of expected rewards (our fair optimization objective). Therefore it is optimizing the lower bound of the optimal values.
[TO CHECK] Here is the proposed initial version of GGF Q-Learning: