Open artofbeinghuman opened 5 years ago
Hi @artofbeinghuman
This is possible, and in fact, is what has happened in the DoomRNN experiment, if you read the section called “Cheating the World Model” (https://worldmodels.github.io/). As you mentioned, C is adapted to z_h and ht of this new scheme, and C learned to exploit an imperfect M and learned a policy that triggered M to predict the next z{t+1} that are not realistic, but allowed it to survive for a long time in such a dream environment, but of course, as you also mention, this new scheme won't fare well when deployed back into the real world.
In the paper, I described a method where if we increase the temperature of the environment, and make the prediction of the z_{t+1} to be more noisy, then it helps to align M back closer to reality, and hence allow C to learn more transferable policies. This, of course, probably works in a subset of real environments, and is worth taking a closer look at. I think it is an exciting research direction to look at the conditions in which policies are transferrable, if you are looking to work in this space.
Best.
Hello, I have a question regarding the learning in the "dream environment", i.e. training the controller (C) and the RNN (M), while excluding the VAE (V), by feeding M with its own predictions z_{t+1}:
When learning inside the latent space of M, is it possible that there could be an "encoding-shift" of the latent encoding zt of the game state, such that through imprecise prediciton of z{t+1} the original "encoding-scheme" of z_t defined by V is not upheld anymore and M effectively uses a similar but different coding scheme for its internal representation/encoding of the game state z_t (resp. the visual state of the game)? That way learnt features in M and C won't be transferable from the dream environment to the real game, because M expects the new encoding scheme as input z_t and resp. the hidden information h_t based on a past of events encoded in the new scheme. Thus, C is also adapted to z_t and h_t of this new scheme and won't fare well with the z_t coming from V and the resulting h_t, once we try to transfer our learned strategies of C and the prediction capabilites of M back into the real world (observing the actual game through V's encoding).
I hope the issue is somewhat clear.
Thank you, M. Baumann