Question regarding the mixture in MDN (pi)

Caselles commented 6 years ago

I wonder how the mixture is done in MDN at test time (so when we want to dream).

http://blog.otoro.net/2015/11/24/mixture-density-networks-with-tensorflow/ : Here we have an example of a MDN with k Gaussians of parameters (mu_k, sigma_k) that are 1-dimensional. E.g mu_k and sigma_k are scalars. Since we can only mix these K Gaussians by taking one of the K possible pair of parameters (mu_k, sigma_k), the shape of Pi is Pi = (Pi_1, ..., Pi_k) with each Pi_j being the (scalar) probability of selecting the j-th Gaussian.

In the case of the MDN in World Models, the k Gaussians of parameters (mu_k, sigma_k) are n-dimensional (with the same dimension as z, the encoded feature coming from the VAE). Hence given one sample there a choice for mixture, you can: 1) have the same shape of Pi as before, with Pi = (Pi_1, ..., Pi_k) with each Pi_j being the (scalar) probability of selecting the j-th Gaussian. 2) Pi is a matrix of size (k, n): Pi = (Pi_1, ..., Pi_n) where n is the dimension of z (encoding from the VAE). For each j, Pi_j = (Pi_j_1, ..., Pi_j_k) where Pi_j_l is the probability of selecting the l-th component of mu_j and sigma_j. So, for each j in 1..n, one can sample according the distribution P_j (this is where you can modify this distribution with temperature to have a more or less stochastic world model if I understand right) to select the j-th component of any of the k gaussians. The result is a new mu and new sigma, both n-dimensional mix of the components of mu_k and sigma_k. Finally, you can sample according to this mixture to obtain the next z.

For 1), there are k Gaussians to choose from, while for 2) there are actually n*k Gaussians to choose from.

Which one is right and why ?

In available implementations, I see that people seem to use 2) (since the shape of Pi is (NB_GAUSSIANS, SIZE_FEATURE_VAE)).

[I also asked my question in the comments on Reddit : https://www.reddit.com/r/MachineLearning/comments/8poc3z/r_blog_post_on_world_models_for_sonic/ . If I get an answer there, I will close the issue and reference the answer there.]

hardmaru commented 6 years ago

Hi @Caselles

During inference (testing), I first sample 1 of the k possible mixtures (from a categorical distribution defined using the πᵢ's).

After picking one of the mixtures, I then sample a Gaussian z vector using the μ and σ vectors for that mixture. I have also done the same thing in sketch-rnn demo.

Caselles commented 6 years ago

I understand, and this raises my concern. Let me explain:

Let NB_GAUSSIAN be the number of Gaussians in the MDN. Let Z_SIZE be the dimension of the latent vector produced by the VAE.

This means that the shape of π, for your implementation is (NB_GAUSSIAN). π is a distribution from which you can sample an integer ranging from 1 to NB_GAUSSIAN. Once you obtain this integer, let's call it j, you sample from the j-th gaussian to obtain the output.

In various implementations, such as this one in PyTorch and this one in Keras, the shape of π is (NB_GAUSSIAN, Z_SIZE). It is composed of a list of Z_SIZE distributions over [1, .. , NB_GAUSSIAN] from which you can sample a component from the NB_GAUSSIAN gaussians.

Here is a simple example :

What are your thoughts on that ? It seems weird to me that there is no fixed answer for this.

hardmaru commented 6 years ago

Hi @Caselles

Thanks for the detailed explanation, I have reread your original message, and your second clear explanation and I understand your question now. Apologies, and please ignore the previous response as that does not properly address your question!

I have followed the approach in (2), and modelled each individual dimension of Z (say of 64 dim) as a mixture of 5 Gaussians in the MDN-RNN. I think this is a reasonable choice if our modelling assumption is that each dimension of Z is independent.

Please refer to the code for DoomRNN, particularly for get_lossfunc, to see that in the loss calculation we optimize for the likelihood of y (which is the pre-processed data given by the VAE),

https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/doomrnn.py#L352

I would love to hear your thoughts as to whether (1) is more preferable to (2). In my view (2) might be more expressive. If you want to experiment with (1), then the training of the MDN-RNN, and the loss functions, will need to be modified to accompany for that.

Thanks!

Caselles commented 6 years ago

Thank you for your clear answer !

I have no prior preference on method (1) or (2). I was just wondering about it because it seems that in the original paper by Bishop about Mixture Density Networks (PDF link), he only considers the case where the predicted output is scalar. Here the output is a vector so we are left with a choice to model each component as a mixture of k gaussians (method (1)) or to model the whole vector as k multivariate gaussians (method (2)).

Indeed, it seems that method (2) is intuitively more expressive. I wonder if method (1) works. It could be that (1) works just as good and is simpler ?

I will be experimenting with (1). Thanks for pointing out that the training and the loss should be modified. I'll let you know if I have any interesting results.

Thanks again for your work and especially for answering questions here, on Reddit, publishing the code. The effort is much appreciated !

worldmodels / worldmodels.github.io

Question regarding the mixture in MDN (pi) #11