About the training time of RL agent

zhejz / carla-roach

Roach: End-to-End Urban Driving by Imitating a Reinforcement Learning Coach. ICCV 2021.

https://zhejz.github.io/roach

Other

293 stars 49 forks source link

About the training time of RL agent #7

Closed penghao-wu closed 2 years ago

penghao-wu commented 2 years ago

Hi, thank you for sharing your excellent work! I want to train your RL agent with a 6 1080Ti (6 Carla server on 6 different GPU) 56 cores machine. But it seems that it takes around 5-6 min for 12288 steps so the total 10M steps will take around 50 days which is not acceptable. Do you know what the possible reason is or how to improve the speed? Thank you!

zhejz commented 2 years ago

10,000,000 / 12288 * 6 = 4883 min = 3.4 day

By the way, the RL training is not a GPU-demanding task, running single CARLA server on multiple GPUs is less efficient due to the synchronization overhead. For RL training we run 6 CARLA servers on 1 GPU, each epoch (12288 steps) takes roughly 10 min on the aws g4 instance equipped with single NVIDIA T4 GPU.

penghao-wu commented 2 years ago

Thank you very much. I misunderstood the "total_timesteps: 1e8" in the training config, the actual total steps needed are only 1e7 steps instead of 1e8 steps.