The vision is essentially to have "tasks" that are defined by (1) their reward scales, (2) starting states/starting env, and (3) termination states. The environment then combines the tasks during training (presenting the robot with a random task (or smth more complicated)) in order to get one policy to be able to follow any of these tasks based on their commands.
The main point of this sort of system would be to make adding new tasks easy, and have reasonable defaults that make new tasks relatively likely to work. Generally, ~democratize "training a robot for a task". Another possible benefit is that a single policy that works on all these tasks might be easier to "finetune" for a new task faster. Unclear how well this would work for these network sizes.
E.g. of tasks:
standing
walking
running
jumping (reward: height of feet in jumps' apex*; termination: landing? time based?)
standing back up (reward: similar to standing but stronger for base height. Gets pushed very strongly ever few seconds. start possibly on the ground, tbd. termination is time-based only, no collision/height termination)
matching dataset of human movements/dances (reward: pos distances with "ground truth"*; termination: the dance ends)
kicking ball in goal, or playing soccer more generally (reward: ball closer to goal*? worried about the robot breaking apart of its own strength if it kicks a ball too hard though, tbd)
fighting other robot (possibly terrible idea, just comes to mind. Reward would be mainly standing and hitting the other robot on head or torso, and adversarial reward where you want the other robot to fall down)
The vision is essentially to have "tasks" that are defined by (1) their reward scales, (2) starting states/starting env, and (3) termination states. The environment then combines the tasks during training (presenting the robot with a random task (or smth more complicated)) in order to get one policy to be able to follow any of these tasks based on their commands.
The main point of this sort of system would be to make adding new tasks easy, and have reasonable defaults that make new tasks relatively likely to work. Generally, ~democratize "training a robot for a task". Another possible benefit is that a single policy that works on all these tasks might be easier to "finetune" for a new task faster. Unclear how well this would work for these network sizes.
E.g. of tasks:
*among other things