xbpeng / DeepMimic

Motion imitation with deep reinforcement learning.
https://xbpeng.github.io/projects/DeepMimic/index.html
MIT License
2.33k stars 489 forks source link

Unable to train a provided example (bad results after 55M of samples). #159

Open AGPX opened 3 years ago

AGPX commented 3 years ago

Hello @xbpeng,

running the original unmodified code, I get the following result after 55 million samples for 'train_amp_strike_humanoid3d_walk_punch_args.txt':

https://youtu.be/LZmtoZxfgV4

as you can see, the actor is still unable to walk properly. It's normal? How many iterations are needed to get the pre-trained result?

Consider that I stopped training (CTRL + C) and restarted after adding the line: '--model_files output/agent0_model.ckpt' in order to reload the model trained so far (avoiding starting from scratch). Could this be the cause? Is it supported to stop and continue the training at a later time?

However, looking at it, I have noticed that despite it works definitely better, even the pre-trained network has some uncertainty in walking, in this example. Is this again an issue related to the number of samples or is there something that can be improved in the strike amp scene (or perhaps in the parameters, such as the probability that the actor appears far from the target (tar_far_prob), which is perhaps too low and this leads the actor to not train the walking enough?).

Thanks in advance,

G.

xbpeng commented 3 years ago

55 million samples is not a lot, especially for a challenging task like this one. You will need to train this policy with about 300-500 million samples.

The code doesn't currently support resuming training. I think what you did, with reloading a previous checkpoint, should be ok. But I'm not sure if there are some internal states that may not be initialized properly.

AGPX commented 3 years ago

The code doesn't currently support resuming training. I think what you did, with reloading a previous checkpoint, should be ok. But I'm not sure if there are some internal states that may not be initialized properly.

Doing such a long training in a single, continuous, run is quite prohibitive for me: the ability to restart from a checkpoint is crucial. I'm going to test if reloading the checkpoint gives me good results (please let me know if you think of any states that may not be initialized properly!). To make the thing easier, I have update the following code:

From rl_world.py, line 84:

if curr_model_file != 'none':
    curr_agent.load_model(curr_model_file)

to

if curr_model_file != 'none':
    if os.path.isfile(curr_model_file + ".index"):
        curr_agent.load_model(curr_model_file)

in order to avoid errors when the model doesn't exists yet (i.e. for the first run).

Anyway, 300-500 million of samples is a very HUGE number. With these numbers, if you are trying to make a new custom scene, it takes a long time to find out that, for example, your reward function is completely wrong! In the AMP strike example (if I had made it myself), I would have concluded that there was some problem in my code, instead it was just a matter of samples! Do you have any suggestions in this regard, on how to evaluate faster if we have made mistakes? Speeding up training would be very helpful. Can, using algorithms other than PPO (such as TRPO) or other optimizers (Adam instead of SGD with momentum), reduce such number of samples? Have you done any experiments?

xbpeng commented 3 years ago

The code can definitely still be optimized to be faster, but off the top of my head, I don't know if there are any easy changes that will lead to significant speedups. Using TRPO instead of PPO likely won't lead to much speed improvements. You could try using Adam, but I tend to find that SGD leads to better results, even if it might take more samples.

AGPX commented 3 years ago

You could try using Adam, but I tend to find that SGD leads to better results, even if it might take more samples.

Looks like this is not really true: https://medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008

"The main finding of this paper is that by tuning all available hyperparameters at scales in deep learning, more general optimizers never underperform their special cases. In particular, they observe that RMSProp, Adam, and NAdam never underperformed SGD, NESTEROV, or Momentum."

The problem is that people generally use Adam with the default hyperparameters and this can lead to worse generalization performance, but the real problem is that ALL the Adam's hyperparameters need to be tuned correctly as well. Using Adam in DeepMimic, I can train successfully many scenes with only 16M of samples (instead of 60M or more).

Link to the referred paper: https://arxiv.org/pdf/1910.05446.pdf

Abstract: "Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when hyperparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the hyperparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent. We also report practical tips around tuning often ignored hyperparameters of adaptive gradient methods and raise concerns about fairly benchmarking optimizers for neural network training."