Closed weidler closed 5 years ago
I diagree. He has to learn to evade obstacles. If he does that he gets a reward. This should be sufficient to learn.
Sure, but I am not talking about learning. I am talking about the progress reports we do during learning:
|-- 91% (Avg. Rew. of 2265.0) |-- 92% (Avg. Rew. of 3503.5) |-- 93% (Avg. Rew. of 3541.6666666666665) |-- 94% (Avg. Rew. of 2651.0) |-- 95% (Avg. Rew. of 4307.666666666667) |-- 96% (Avg. Rew. of 3405.0) |-- 97% (Avg. Rew. of 3523.3333333333335) |-- 98% (Avg. Rew. of 3034.8333333333335) |-- 99% (Avg. Rew. of 4230.666666666667)
This looks relatively random or at least not very good, although the policy is already very good. This is why I believe that the average reward from the last episodes during training with exploration is not very indicative here. The loss of the DQN may be more helpful. Or some intermediate test-episodes.
Oh, misunderstanding then. I agree. More info is better.
Info of prints now show avg repr loss, policy loss, latest rewards and time elapsed. What else do you think could be useful?
I would say thats fine now :+1:
Especially in the racing tasks where one mistake means a loss, we need a better way to indicate learning progress than episode length, because the exploration is too destructive.