Fix Q learning - Githubissues

The previous version optimizes the expected GGF rewards instead of the GGF of expected rewards (our fair optimization objective). Therefore it is optimizing the lower bound of the optimal values.

[TO CHECK] Here is the proposed initial version of GGF Q-Learning:

x-tu / GGF-wcMDP

Fix Q learning #19