worldmodels / worldmodels.github.io

World Models
Creative Commons Attribution 4.0 International
430 stars 55 forks source link

Question regarding RNN-MDN #6

Open vipinpillai opened 6 years ago

vipinpillai commented 6 years ago

Thanks for sharing the demo. I had some questions regarding the RNN-MDN module used to sample zt+1.

Since the latent sample is 32-dimensional for the car racing experiment, does the MDN model each distribution within the mixture as a Multivariate Normal distribution?

Could you please share some details regarding the explicit separation between the RNN and the MDN modules and the necessary tweaks needed for stabilizing the training of the MDN and to keep the NLL loss calculation tractable for the multivariate MDN.

hardmaru commented 6 years ago

Hi @vipinpillai

Please also refer to the RNN Description in Appendix:

For the M Model, we use an LSTM recurrent neural network combined with a Mixture Density Network as the output layer. We use this network to model the probability distribution of the next z in the next time step as a Mixture of Gaussian distribution (NB: This answers @vipinpillai's 1st question). This approach is very similar to Graves’ Generating Sequences with RNNs in the Unconditional Handwriting Generation section and also the decoder-only section of SketchRNN. The only difference in the approach used is that we did not model the correlation parameter between each element of z), and instead had the MDN-RNN output a diagonal covariance matrix of a factored Gaussian distribution. (This answers part of @vipinpillai's 2nd question regarding making training stable and loss calculation tractable)

Unlike the handwriting and sketch generation works, rather than using the MDN-RNN to model the pdf of the next pen stroke, we model instead the pdf of the next latent vector z. The MDN-RNNs were trained for 20 epochs on the data collected from a random policy agent. In the Car Racing task, the LSTM used 256 hidden units, while the Doom task used 512 hidden units. In both tasks, we used 5 Gaussian mixtures and did not model the correlation ρ parameter, hence z is sampled from a factored mixture of Gaussian distribution. (@vipinpillai: please keep this sampling in mind, since it helps safeguard against overfitting to a stored set of z's)

When training the MDN-RNN using teacher forcing from the recorded data, we store a pre-computed set of μ and σ for each of the frames, and sample an input z ∼N(μ,σ) each time we construct a training batch, to prevent overfitting our MDN-RNN to a specific sampled z.

I put some pointers to your questions referencing the article. Please let me know if you have any more questions!

vipinpillai commented 6 years ago

Thanks a lot @hardmaru for providing detailed pointers.

One last implementation specific question is whether you have considered each dimension of latent z as a mixture of Univariate Gaussians or the entire z as a mixture of Multivariate Gaussians while training the MDN-RNN. I am asking this due to the numerical issues of computing the NLL for GMM where each Gaussian is 32 dimensional. In the approach used by Graves’ Generating Sequences with RNNs, we have a mixture of bi-variate gaussians, and hence it is less prone to numerical instability during training.

hardmaru commented 6 years ago

Hi @vipinpillai,

z is modelled as a factored Gaussian, so the correlation ρ parameter is assumed to be zero, unlike Graves (2013) that modelled in a correlation term. This decreases the model complexity, and also makes the training much more numerically stable, as you can operate in log-space and avoid having to take the logarithm of exponentials for your computational graph.

vipinpillai commented 6 years ago

Thanks @hardmaru for the quick response. I understand that you have not included the correlation parameter. However, I'm still trying to understand the answer to the following question: Have you considered each dimension of latent z as a mixture of Univariate Gaussians or the entire z as a mixture of Multivariate Gaussians?

hardmaru commented 6 years ago

No, I have not tried that. Feel free to try it out yourself if you think it is interesting.

dariocazzani commented 6 years ago

Hi, I'd like to jump in the discussion. I have 2 more questions:

  1. Can you share what sequence length you used during training? I assuming 300 (based on your work on generating handwriting even though the problem is different)
  2. tau is set to 1.0 at training time, right?

Thanks in advance.

hardmaru commented 6 years ago

Hi @dariocazzani,

I used sequence lengths of 1000 timesteps. These long lengths are possible since we don't need to train the VAE at the same time as the MDN-RNN, enabling us to save lots of GPU memory and learn all of the long-term dependencies. During training, tau is 1.0.

dariocazzani commented 6 years ago

Thanks @hardmaru for the answers. From your blog you say:

We have first an agent acting randomly to explore the environment multiple times, and record the random actions at taken and the resulting observations from the environment.

With random actions and such long sequences, in Car Racing the car ends up spending most of the time in the grass. The MDN-RNN will learn very well that once in grass, no matter what you do, you stay in the grass. (Not really, but practically with random actions and no intention to recover, this is what happens)

This seems like a waste, or am I missing something?

hardmaru commented 6 years ago

hi @dariocazzani,

Good question! I also had to think about this in the experiments. Another person who reproduced the CarRacing task used a method to encourage more diversity:

https://medium.com/applied-data-science/how-to-build-your-own-world-model-using-python-and-keras-64fb388ba459

Have you thought about potential approaches that might help overcome this issue? I'd be curious to know what other people come up with! Feel free to list some here.

I'll let you know what I did to generate the random actions to encourage a more diverse set, and I think my approach is more elegant than simply hitting the accelerate pedal :)

dariocazzani commented 6 years ago

The person who wrote that blog post did a few things:

This works well if the car starts at the same spot and if we limit the rollouts at 300.

What I did was first of all to make the car start at random points for each rollout:

from gym.envs.box2d.car_dynamics import Car
from gym.envs.box2d import CarRacing
[...]
position = np.random.randint(len(env.track))        
env.car = Car(env.world, *env.track[position][1:4])

I made a PR to OpenAI gym to make this possible without "tricks" Link to PR

And this is my policy for generating random actions. Notice that I can not count on a straight road at the beginning

def generate_action(prev_action):
    if np.random.randint(3) % 3:
        return prev_action

    index = np.random.randn(3)
    # Favor acceleration over the others:
    index[1] = np.abs(index[1])
    index = np.argmax(index)
    mask = np.zeros(3)
    mask[index] = 1

    action = np.random.randn(3)
    action = np.tanh(action)
    action[1] = (action[1] + 1) / 2
    action[2] = (action[2] + 1) / 2

return action*mask

When I run the prediction I assume that the car never has to brake and accelerate at the same time. Thus I could reduce the number of actions to 2. This was beneficial for reducing by ~33% the number of parameters of the Controller

[...]
    action[0] = prediction[0]
    if prediction[1] < 0:
        action[1] = np.abs(prediction[1])
        action[2] = 0
    else:
        action[2] = prediction[1]
        action[1] = 0
return action

I am documenting everything on a series of blog posts (still WIP since I need to work on it on my spare time). The first episode is World Models in TensorFlow — Episode 1.0 — OpenAi Gym Race Car Counting on releasing the work on Doom soon :)

Suggestions for improvements are always welcome :)

Thanks again for the feedback @hardmaru

hardmaru commented 6 years ago

Hi @dariocazzani

That's a nice strategy of starting at a random place on the track. However, I would be careful not to change the "official" carracing-v0 to have this setting, since it should be kept the same way for evaluation purposes against published results using this environment. Perhaps making a fork, a separate environment (for the purpose of training an agent only, not for evaluation), might make more sense? You would still need to evaluate on the original carracing-v0 to compare results with previous published methods.

Reducing the action space to 2 is also clever. I did a similar thing for the DoomTakeCover scenario and reduced the action space to 1 real value action.

To generate random episodes that are more diverse, rather than using a simple random policy based on sampling the action space uniformly, what I did was I initialized V, M, and C with random weights, sampled from a normal distribution with zero mean and a small standard deviation parameter. This way, the random agent would go about using its randomized policy to drive around the track in a way that would be more diverse and also in a way that represents a natural prior of what the agent can do, since in the end the agent has to learn a set of parameters of V, M, C from this sampled parameter space anyways.

Good luck!

dariocazzani commented 6 years ago

Hey @hardmaru, I'll respond point per point.

An extra bit: Variation over using MDNs for M I was thinking to experiment with the RNN outputting 2 vectors: μRNN and σRNN to use to sample the predicted frame as zRNN ∼N(μRNNRNN). The loss function would be the KL divergence between N(μRNNRNN) and N(μVAEVAE).

I would compute the KL divergence in closed form. The Monte Carlo method might not be suitable for small vectors like these ones. What do you think? Have you tried it yet? I'll see how it goes.