zhejz / carla-roach

Roach: End-to-End Urban Driving by Imitating a Reinforcement Learning Coach. ICCV 2021.
https://zhejz.github.io/roach
Other
274 stars 50 forks source link

Questions about exploration loss design #39

Closed gujiamitu closed 2 months ago

gujiamitu commented 3 months ago

Thanks for this wonderful work. As the experiments show, the exploration loss greatly improves PPO performance. The intuition behind it is what? And how to define different beta distributions after different events? For example, when running a red light / colliding with other agents, why introduce the distribution beta(1, 2.5) as p_z? Finally, are there any mathematical modeling works that can be referred to get a better understanding?

zhejz commented 2 months ago

The intuition can be found on page 4 of the paper.

If z is related to a collision or running traffic light/sign, we apply pz = B(1, 2.5) on the acceleration to encourage Roach to slow down while the steering is unaffected. In contrast, if the car is blocked we use an acceleration prior B(2.5, 1). For route deviations, a uniform prior B(1, 1) is applied on the steering. Despite being equivalent to maximizing entropy in this case, the exploration loss further encourages exploration on steering angles during the last 10 seconds before the route deviation.

It does not have a rigorous mathematical modeling. And the design of the distribution is quite arbitrary. Just make sure the "direction" is correct, i.e. if you want to encourage deceleration, the distribution should have a mean less than 0.