Questions about the Multiplicative Compositional Policies (MCP)

ysluo commented 5 years ago

I tried to implement your recently published paper "MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies" according to my understanding. But I can't get the same result as the one shown in the supplementary video. During pre-training task, the learned agent keeps standing while I constantly switch between different motions. And the weights output from gating network are mostly 1 in all primitives.

There are some questions I have.

Does the imitation objective in pre-training tasks computed based on the currently selected motion only (like the way to train a skill selector in DeepMimic)? And also is there any goal objective (target states of next two timesteps) involved in the rewared function?
Does covariance matrix \Sigma have a dedicated normalizer? Like state, goal and action mean all have their own normalizer in DeepMimic.
By comparing the dimension of the state features (Table1 in DeepMimic paper and Table3 in MCP paper), I found T-Rex and humanoid characters both have one less dimension in MCP. Does the phase variable \phi be removed from state features in MCP?
Is it possible to learn skills that have different velocity (standing, walking and running) using MCP?

Thanks for your patent to read my questions.

xbpeng commented 5 years ago

One of the differences between MCP and DeepMimic, is that the target states from the reference motion is provided as part of the input to the policy in MCP, but in DeepMimic, the policy doesn't get direct access to the reference motion. Also when you provide the target state as input, does both the policy and the gating function get the target state as input? If both of them do then that is probably why your gating function is becoming degenerate. In MCP only the gating function gets the target state as input, not the policy.

1) Yes the reward is calculated using the current motion. What do mean by goal objective for the target states in the reward function? The reward is calculated using the character's current state and the next state from the target reference motion. So the reward does involve the target state.

2) Yes the phase variable is not part of the state in MCP, since the target state input already provides phase information. There are some slight difference with the action parameterization, so the dimensions are smaller in MCP, but that shouldnt' be very important.

3) Yes it should be possible to learn these different skills with MCP. Just feed the policy different ref motions as input.

ysluo commented 5 years ago

Thank you for the reply and clarification.

The target states are only received in gating network in my implementation. However I did solve this problem by normalizing the primitive weights from gating network. After trying to plot the Gaussian primitives in 2D like the way you did in the supplementary video, I found that the composite distribution is correct only if the sum of all primitive weights is equal to 1. Not sure if you did something similar in your implementation.

xbpeng commented 5 years ago

Good to hear that you found a solution. Can you clarify what you mean by the composite distribution is correctly only when the weights sum to 1? We didnt have to explicitly normalize the weights from the gating function. Since when composing the gaussians, the weights in equation 3 get normalized by the denominator.

ysluo commented 5 years ago

I followed the equation 3 to visualize the Gaussian primitives in 2D. And the bellow image is what I got. $\pi_0,\pi_1,\pi_2,\pi_3$ are Gaussian primitives and $\pi$ is the composite distribution. Figure 1 shows $\sigma$ of the composite distribution is 0.025, which is equal to primitives $\sigma$ , when the weights sum to 1. But in figure 2 the composite distribution looks different from primitive distributions when the weights don't sum to 1. I'm not sure which one is correct, but Figure 1 does make more sense to me. That is why I came up the idea to normalize the weights.

xbpeng commented 5 years ago

I see. Note that increasing w has a similar effect as decreasing the variance of a primitive's distribution. But both the normalized and unnormalized w's should still produce a valid distribution. Except maybe the normalized distribution is a bit more intuitive. Keeping the w's unnormalized gives the gating function a little more flexibility to adjust the variance of the composite distribution. But I think both should be all right.

xbpeng / DeepMimic

Questions about the Multiplicative Compositional Policies (MCP) #64