xbpeng / DeepMimic

Motion imitation with deep reinforcement learning.
https://xbpeng.github.io/projects/DeepMimic/index.html
MIT License
2.33k stars 489 forks source link

My Multi-Clip Reward implementation results. #120

Open AGPX opened 4 years ago

AGPX commented 4 years ago

Hi,

I have tried to implement the Multi-Clip reward technique, but so far I have not achieved great results. My goal is to reproduce the steering behavior shown in the document. I made the following steps:

1] Added loading of multiple MoCap clips (walk_forward, turn_left0, turn_left1, turn_left2, turn_right0, turn_right1, turn_right2). For convenience, I've modified them in order to have the same number of frames and the same frame rate (30 fps) and so the same duration. I think this is important, because we have a single controller and I don't know which of the 7 clips to choose (for example) to set the controller cycle period (see SceneImitate::BuildController method). All kinematic actors are perfectly synchronized as you can see from this video:

Https://youtu.be/MXAnrLQL2fo

As usual, when a clip reachs its end, the kinematic actor is repositioned and reoriented according to the position and orientation of the simulated actor (in green, not shown in this video). One question: all motion clips have the "wrap" flag set to "loop". Is it correct? Or should I set to "loop" only the "walk forward"? Can this affect the result?

2] Added the heading goal to the network through 2 floats representing a unit vector (dx, 0, dz): it only represents the direction.

3] Setted the sync_char_root_rot flag to true (in addition to thesync_char_root_pos). I think this is necessary, otherwise when the actor will be aligned with the direction vector, it will not select the walk forward. Am I right?

4] Added two functions to the SceneImitate.cpp:

double cSceneImitate::CalcKinHeadingRewardGoal(const cSimCharacter& sim_char, const cKinCharacter& kin_char) const
{
    double reward = 0.0;
    auto ct_ctrl = dynamic_cast<cCtController *>(this->GetController().get());
    if (ct_ctrl != nullptr)
    {
        const auto goalDir = ct_ctrl->GetGoalDir();

        const auto& joint_mat = sim_char.GetJointMat();
        const auto& body_defs = sim_char.GetBodyDefs();

        const Eigen::VectorXd& pose1 = kin_char.GetPose();
        const Eigen::VectorXd& vel1 = kin_char.GetVel();

        tVector com1_world;
        tVector com_vel1_world;
        cRBDUtil::CalcCoM(joint_mat, body_defs, pose1, vel1, com1_world, com_vel1_world);

        com_vel1_world.y() = 0.0;
        com_vel1_world.normalize();

        const auto angle = __max(0.0, 1.0 - com_vel1_world.dot(goalDir));
        reward = exp(-2.5 * angle * angle);
    }

    return reward;
}

and

double cSceneImitate::CalcSimHeadingRewardGoal(const cSimCharacter& sim_char) const
{
    double reward = 0.0;
    auto ct_ctrl = dynamic_cast<cCtController *>(this->GetController().get());
    if (ct_ctrl != nullptr)
    {
        const auto goalDir = ct_ctrl->GetGoalDir();

        // Get the Center-Of-Mass velocity along the XZ plane only...
        tVector com_vel0_world = sim_char.CalcCOMVel();
        com_vel0_world.y() = 0.0;
        com_vel0_world.normalize();

        const auto angle = __max(0.0, 1.0 - com_vel0_world.dot(goalDir));
        reward = exp(-2.5 * angle * angle);
    }
    return reward;
}

The first calculates the heading reward for the kinematic actor. The latter do the same for the simulated one.

5] Implemented (in CalcReward) 3 methods to calculate the Multi-Clip premium:

Method 1:


double r = 0, bestReward = 0;
std::shared_ptr<cKinCharacter> mBestKin = nullptr;
for (auto &mKinChar : mKinChars)
{
    const double r2 = CalcKinHeadingRewardGoal(*sim_char, *mKinChar);
    if (r2 > bestReward)
    {
        bestReward = r2;
        mBestKin = mKinChar;
    }
}
r = CalcRewardImitate(*sim_char, *mBestKin) * 0.70 + CalcSimHeadingRewardGoal(*sim_char) * 0.30;
return r;

This method first selects the best kinematic actor based on its the heading reward (i.e. the actor mostly aligned with the heading direction vector). After choosing one, the final reward is given by the imitate reward (70%) and by the simulated actor's heading reward (30%).

Method 2:

double r = 0;
for (auto &mKinChar : mKinChars)
{
    const double r2 = CalcRewardImitate(*sim_char, *mKinChar) * 0.70 + CalcKinHeadingRewardGoal(*sim_char, *mKinChar) * 0.30;
    r = __max(r, r2);
}
return r;

The second method totally ignores the reward of the simulated actor. It calculates the reward for each kinematic actor considering its heading and imitate reward, then take the maximum.

Method 3:

double r = 0;
for (auto &mKinChar : mKinChars)
{
    const double r2 = CalcRewardImitate(*sim_char, *mKinChar) * 0.70 +
                                  CalcKinHeadingRewardGoal(*sim_char, *mKinChar) * 0.15 +
                                  CalcSimHeadingRewardGoal(*sim_char) * 0.15;
    r = __max(r, r2);
}
return r;

The latter method is similar to the second, but the simulated actor's heading reward is also added.

Ok, just to simplify things, I've fixed the heading direction (during training) to 45 degrees. The 1st and 3rd methods give a result like the following video shows:

https://youtu.be/5XuMRXfK1os

As you can see, the actor deviates only a little but, above all, it doesn't walk correctly: it advances by doing consecutive mini jumps! I tried to tune the weighting factors, but these mini jumps don't disappear!

The 2nd method is more promising:

https://youtu.be/pJe0ab6uu3M

As you can see, this time the actor (almost) follows the set direction of 45 degrees, but it limps a bit (trained for 50 million samples).

So, none of these methods are completely satisfactory (and I haven't tested them yet by varying the heading vector during training). What is missing or I've misunderstood? Can you please give me some good direction on how to properly model the reward function? Do I have to consider the heading reward of both kinematic and simulated actors? Or only one? And maybe, with multiple clips, we have to train the actor for by far more samples? Lastly, could the consecutive mini jumps exhibited by the 1st and 3rd method be due to a local maximum?

The questions are many, I know. Thanks for any reply.

tfederico commented 4 years ago

Hi, I am also trying to reproduce the multi-clip implementation.

Could you share how you loaded multiple clips and created several kinematic characters?