How to implement walking policy training with steerable controller?

raffaello-camoriano commented 5 years ago

Hi and thanks again for your amazing work.

Since the provided code exclusively covers imitation, I am trying to implement myself the humanoid walking policy with directional input.

To do so, I am following the details in the paper, and in particular Sec. 9 - Target Heading.

I have a couple of questions on implementation details to begin with:

How are desired directions d_t^* and speeds v^* randomly generated during policy training? Are they varied in order to produce "informative" bends, changes in pace and otherwise rich walking behaviors? If this is the case, could you please explain it at an implementable level of detail?
Which is the most convenient way for easily extracting v_t, the center-of-mass velocity of the simulated character?

Thanks a lot.

xbpeng commented 5 years ago

You can get the center of mass by calling cSimCharacter::CalcCOMVel().

To vary the target heading during training, we just apply some random gaussian noise to the heading direction every timestep. The noise is fairly small. But with something like 10% probability, at every timestep, we will also just sample a completely new heading direction between [-pi, pi]. This encourages the policy to also learn sharp turns.

ZhengyiLuo commented 5 years ago

You can get the center of mass by calling cSimCharacter::CalcCOMVel().

To vary the target heading during training, we just apply some random gaussian noise to the heading direction every timestep. The noise is fairly small. But with something like 10% probability, at every timestep, we will also just sample a completely new heading direction between [-pi, pi]. This encourages the policy to also learn sharp turns.

@xbpeng I have tried the above approach but the result is usually the agent walking in circles (like it is taking an average of the heading). I am wondering what is the ratio of the heading and pose reward for getting a heading behavior?

Also 10% at every timestep seems very high, like is this timestep 0.00166667s? For 10% the kin model is really jittering.

xbpeng commented 5 years ago

Yes you will need to tune the weight between the heading and pose reward. For our tasks we were using 0.7 for the weight of the imitation reward and 0.3 for the heading reward. To debug, it might be helpful to just fix the target heading to always point in one direction (e.g. the positive x direction) and see if the character learns to walk straight.

Oh, sorry I forgot about a little detail for the target heading. We are not updating the target heading every environment step. We update the target heading once every 0.25 seconds by applying uniform noise between [-0.15, 0.15]rad to the current heading. Then we also have a 10% prob every 0.25 seconds of changing to a completely random heading. Sorry about the confusion. See if these new settings help at all.

raffaello-camoriano commented 5 years ago

Dear @xbpeng,

I was wondering if the 5 walking, running and jogging clips with different turning radii which you mentioned in Section 10.2 of the paper and show in the main video at 5:32 are downloadable somewhere or could kindly be provided, since I want to investigate the performance gap between single-clip and multi-clip integration steerable walking controllers.

xbpeng commented 5 years ago

Sure, you can find the motions here: https://drive.google.com/drive/folders/1TRINX1efQQV4KkWeOQjhYGNoz8PHKqgT?usp=sharing

raffaello-camoriano commented 5 years ago

Thank you so much!

Zju-George commented 5 years ago

@xbpeng @raffaello-camoriano @ZhengyiLuo Do I have to add heading_reward here?

raffaello-camoriano commented 5 years ago

@xbpeng Should I set root_w = 0 after adding the heading-based task_reward?

There seems to be a conflict between task_reward and pure-imitation root_w * root_reward, since:

root_reward includes root position and orientation mismatch w.r.t. straight-walking MoCap (which I could not find in Sec. 5.3), which penalizes non-straight-walking
task_reward, on the other hand, rewards walking with a generic heading

I also modified the code for expressing the end_eff_reward relative to the Sim or Kin humanoids' root instead of the world frame, but this is not enough to fix the issue.

The best behaviour obtained with these settings is the humanoid walking in circles.

Do you have any suggestions?

Thank you.

raffaello-camoriano commented 5 years ago

@xbpeng @raffaello-camoriano @ZhengyiLuo Do I have to add heading_reward here?

Yes.

xbpeng commented 5 years ago

@xbpeng @raffaello-camoriano @ZhengyiLuo Do I have to add heading_reward here?

Yes, you can add an additional reward term for the heading there.

@xbpeng Should I set root_w = 0 after adding the heading-based task_reward?

There seems to be a conflict between task_reward and pure-imitation root_w * root_reward, since:

root_reward includes root position and orientation mismatch w.r.t. straight-walking MoCap (which I could not find in Sec. 5.3), which penalizes non-straight-walking

task_reward, on the other hand, rewards walking with a generic heading

I also modified the code for expressing the end_eff_reward relative to the Sim or Kin humanoids' root instead of the world frame, but this is not enough to fix the issue.

The best behaviour obtained with these settings is the humanoid walking in circles.

Do you have any suggestions?

Thank you.

Yes, it might be a good idea to disable the root reward if you are going to add the heading term. You can also play around with the weights for the different objectives a bit to get the desired behaviours. I'm not sure why the character would be walking in circles. But sounds like it could be an issue with the heading reward?

AGPX commented 4 years ago

@xbpeng , first of all, thanks for sharing your outstanding work. I'm trying to implement the heading too. Thanks for the extra details about how and when you change the heading (and also for the extra motion files with human turning left and right). I have added the reward here:

double cSceneImitate::CalcReward(int agent_id) const
{
    const cSimCharacter* sim_char = GetAgentChar(agent_id);
    bool fallen = HasFallen(*sim_char);

    double r = 0;
    int max_id = 0;
    if (!fallen)
    {
        if (haveGoalDir()) {
            r = CalcRewardImitate(*sim_char, *mKinChar) * 0.7 + CalcRewardGoal(*sim_char, *mKinChar) * 0.3;
        } else {
            r = CalcRewardImitate(*sim_char, *mKinChar);
        }
    }
    return r;
}

double cSceneImitate::CalcRewardGoal(const cSimCharacter& sim_char, const cKinCharacter& kin_char) const
{
    double reward = 0.0;
    auto ct_ctrl = dynamic_cast<cCtController *>(this->GetController().get());
    if (ct_ctrl != nullptr)
    {
        const auto goalDir = ct_ctrl->GetGoalDir();

        // Get the Center-Of-Mass velocity along the XZ plane only...
        tVector com_vel0_world = sim_char.CalcCOMVel();
        com_vel0_world.y() = 0.0;
        com_vel0_world.normalize();

        const auto angle = __max(0.0, 1.0 - com_vel0_world.dot(goalDir));
        reward = exp(-2.5 * angle * angle);
    }
    return reward;
}

How much it looks correct in your opinion?

P.S.: I must definitely try the suggestion to set root_w to 0 (or close to zero).

xbpeng / DeepMimic

How to implement walking policy training with steerable controller? #66