zhejz / carla-roach

Roach: End-to-End Urban Driving by Imitating a Reinforcement Learning Coach. ICCV 2021.
https://zhejz.github.io/roach
Other
274 stars 50 forks source link

Train RL Experts #9

Closed Yiquan-lol closed 2 years ago

Yiquan-lol commented 2 years ago

Hello, I'm executing run / train_ rl. sh, I found that several Carla windows were opened at the same time, and then crashed. I learned in your paper that RL experts are trained at the same time in Town 1-6, so I think my GPU may not meet the needs of the code. Please tell me the number and model of graphics cards you use when training RL experts.

zhejz commented 2 years ago

The RL training was run on either one RTX 2080ti or one NVIDIA T4 (aws G4dn instance). Train the RL with 6 CARLA servers would require more than 10GB GPU memory, if that's too much for your GPU try to train on less towns by removing some envs in this config file.

Yiquan-lol commented 2 years ago

Following your tips, I reduced the number of Carla servers. However, the following errors were reported:

CarlaUE4-Linux: no process found [2021-12-20 16:00:20,310][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-20 16:00:21,321][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-20 16:00:21,321][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. [2021-12-20 16:00:26,357][__main__][INFO] - making port 2000 /home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose 'Don't visualize my results' wandb: Offline run mode, not syncing to the cloud. wandb: W&B syncing is set toofflinein this directory. Runwandb online` to enable cloud syncing. wandb: WARNING Symlinked 3 files into the W&B run directory, call wandb.save again to sync new files. trainable parameters: 1.53M ./run/train_rl.sh: line 3: 6625 Segmentation fault (core dumped) python -u train_rl.py agent.ppo.wb_run_path=null wb_project=train_rl_experts wb_name=roach agent/ppo/policy=xtma_beta agent.ppo.training.kwargs.explore_coef=0.05 carla_sh_path=${CARLA_ROOT}/CarlaUE4.sh PYTHON_RETURN=139!!! Start Over!!! /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh: line 5: 6681 Killed "$UE4_PROJECT_ROOT/CarlaUE4/Binaries/Linux/CarlaUE4-Linux-Shipping" CarlaUE4 $@ [2021-12-20 16:08:00,560][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-20 16:08:01,570][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-20 16:08:01,571][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. /home/jjuv/anaconda3/envs/roach/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 6 leaked semaphores to clean up at shutdown len(cache)) wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose 'Don't visualize my results' Traceback (most recent call last): File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/normalize.py", line 24, in wrapper return func(*args, **kwargs) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/public.py", line 458, in run self._runs[path] = Run(self.client, entity, project, run) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/public.py", line 849, in init self.load(force=not attrs) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/public.py", line 949, in load raise ValueError("Could not find run %s" % self) ValueError: Could not find run <Run train_rl_experts/gfvocthw/gfvocthw (not found)>

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train_rl.py", line 40, in main agent = AgentClass('config_agent.yaml') File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 15, in init self.setup(path_to_conf_file) File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 23, in setup run = api.run(cfg.wb_run_path) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/normalize.py", line 62, in wrapper six.reraise(CommError, CommError(message, err), sys.exc_info()[2]) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/six.py", line 702, in reraise raise value.with_traceback(tb) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/normalize.py", line 24, in wrapper return func(*args, **kwargs) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/public.py", line 458, in run self._runs[path] = Run(self.client, entity, project, run) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/public.py", line 849, in init self.load(force=not attrs) File "/home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/wandb/apis/public.py", line 949, in load raise ValueError("Could not find run %s" % self) wandb.errors.error.CommError: Could not find run <Run train_rl_experts/gfvocthw/gfvocthw (not found)>

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [2021-12-20 16:08:12,141][wandb.sdk.internal.internal][INFO] - Internal process exited PYTHON_RETURN=1!!! Start Over!!! /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh: line 5: 12077 Killed "$UE4_PROJECT_ROOT/CarlaUE4/Binaries/Linux/CarlaUE4-Linux-Shipping" CarlaUE4 $@ [2021-12-20 16:08:16,992][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-20 16:08:18,005][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-20 16:08:18,006][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: `

I'm not sure if this is related to wandb. What's going on?

zhejz commented 2 years ago

As suggested by the error log, you should set up wandb first. Creat a wandb account and log in on your computer.

Yiquan-lol commented 2 years ago

That means wandb must be used, right?And do I have to log in to wandb even if I choose “(3) Don't visualize my results”?

zhejz commented 2 years ago

Yes, this repo supports only wandb. If another way to logging is desired you will have to implement by yourself.

Yiquan-lol commented 2 years ago

As you said, I registered and logged in to wandb, but the following error occurred at runtime:

CarlaUE4-Linux: no process found [2021-12-23 11:21:27,462][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-23 11:21:28,486][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-23 11:21:28,486][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. [2021-12-23 11:21:33,555][main][INFO] - making port 2000 /home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) wandb: Currently logged in as: yqlol (use wandb login --relogin to force relogin) wandb: Tracking run with wandb version 0.10.12 wandb: Syncing run roach wandb: ⭐️ View project at https://wandb.ai/yqlol/train_rl_experts wandb: 🚀 View run at https://wandb.ai/yqlol/train_rl_experts/runs/37rjz3na wandb: Run data is saved locally in /home/jjuv/carla-roach-main/outputs/2021-12-23/11-21-26/wandb/run-20211223_112136-37rjz3na wandb: Run wandb offline to turn off syncing.

wandb: WARNING Symlinked 3 files into the W&B run directory, call wandb.save again to sync new files. trainable parameters: 1.53M

Stuck in 'trainable parameters: 1.53M', what's going on?

Yiquan-lol commented 2 years ago

In addition, when I stopped the code and started again, the following error occurred:

/home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh: line 5: 18970 Killed "$UE4_PROJECT_ROOT/CarlaUE4/Binaries/Linux/CarlaUE4-Linux-Shipping" CarlaUE4 $@ [2021-12-23 11:36:20,038][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-23 11:36:21,054][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-23 11:36:21,055][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. Traceback (most recent call last): File "train_rl.py", line 40, in main agent = AgentClass('config_agent.yaml') File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 15, in init self.setup(path_to_conf_file) File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 27, in setup f = max(allckpts, key=lambda x: int(x.name.split('')[1].split('.')[0])) ValueError: max() arg is an empty sequence

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [2021-12-23 11:36:28,096][wandb.sdk.internal.internal][INFO] - Internal process exited PYTHON_RETURN=1!!! Start Over!!!

zhejz commented 2 years ago

It's hard to tell why the training is stuck but most likely it is related to the CARLA server. Maybe try to debug and find at which line the script is stuck, and double check the installation doc and follow the instruction closesly.

When you stop the code and start again, train_rl.py will try to automatically resume from the latest checkpoint. Your error occurs because the training script failed hence no checkpoint was saved. Simply remove the outputs/checkpoint.txt will let the script train from the scratch again.

raozhongyu commented 2 years ago

In addition, when I stopped the code and started again, the following error occurred:

/home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh: line 5: 18970 Killed "UE4PROJECTROOT/CarlaUE4/Binaries/Linux/CarlaUE4−Linux−Shipping"CarlaUE4@ [2021-12-23 11:36:20,038][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-23 11:36:21,054][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-23 11:36:21,055][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. Traceback (most recent call last): File "train_rl.py", line 40, in main agent = AgentClass('config_agent.yaml') File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 15, in init self.setup(path_to_conf_file) File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 27, in setup f = max(allckpts, key=lambda x: int(x.name.split('')[1].split('.')[0])) ValueError: max() arg is an empty sequence

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [2021-12-23 11:36:28,096][wandb.sdk.internal.internal][INFO] - Internal process exited PYTHON_RETURN=1!!! Start Over!!!

I meet the same error. Have you solved it?

wangchangquan commented 1 year ago

In addition, when I stopped the code and started again, the following error occurred: /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh: line 5: 18970 Killed "UE4PROJECTROOT/CarlaUE4/Binaries/Linux/CarlaUE4−Linux−Shipping"CarlaUE4@ [2021-12-23 11:36:20,038][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-23 11:36:21,054][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-23 11:36:21,055][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. Traceback (most recent call last): File "train_rl.py", line 40, in main agent = AgentClass('config_agent.yaml') File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 15, in init self.setup(path_to_conf_file) File "/home/jjuv/carla-roach-main/agents/rl_birdview/rl_birdview_agent.py", line 27, in setup f = max(allckpts, key=lambda x: int(x.name.split('')[1].split('.')[0])) ValueError: max() arg is an empty sequence Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [2021-12-23 11:36:28,096][wandb.sdk.internal.internal][INFO] - Internal process exited PYTHON_RETURN=1!!! Start Over!!!

I meet the same error. Have you solved it?

I meet the same error. Have you solved it?

zhejz commented 1 year ago

It's hard to tell why the training is stuck but most likely it is related to the CARLA server. Maybe try to debug and find at which line the script is stuck, and double check the installation doc and follow the instruction closesly.

When you stop the code and start again, train_rl.py will try to automatically resume from the latest checkpoint. Your error occurs because the training script failed hence no checkpoint was saved. Simply remove the outputs/checkpoint.txt will let the script train from the scratch again.

And here is the solution.

jameskobee commented 1 year ago

As you said, I registered and logged in to wandb, but the following error occurred at runtime:

CarlaUE4-Linux: no process found [2021-12-23 11:21:27,462][utils.server_utils][INFO] - Kill Carla Servers! CarlaUE4-Linux: no process found [2021-12-23 11:21:28,486][utils.server_utils][INFO] - Kill Carla Servers! [2021-12-23 11:21:28,486][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /home/jjuv/carla/CARLA_0.9.10.1/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000 4.24.3-0+++UE4+Release-4.24 518 0 Disabling core dumps. [2021-12-23 11:21:33,555][main][INFO] - making port 2000 /home/jjuv/anaconda3/envs/roach/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) wandb: Currently logged in as: yqlol (use wandb login --relogin to force relogin) wandb: Tracking run with wandb version 0.10.12 wandb: Syncing run roach wandb: ⭐️ View project at https://wandb.ai/yqlol/train_rl_experts wandb: 🚀 View run at https://wandb.ai/yqlol/train_rl_experts/runs/37rjz3na wandb: Run data is saved locally in /home/jjuv/carla-roach-main/outputs/2021-12-23/11-21-26/wandb/run-20211223_112136-37rjz3na wandb: Run wandb offline to turn off syncing.

wandb: WARNING Symlinked 3 files into the W&B run directory, call wandb.save again to sync new files. trainable parameters: 1.53M

Stuck in 'trainable parameters: 1.53M', what's going on?

I have the same problem, can you tell me how to solve it, thanks

wangchangquan commented 1 year ago

I‘m in china, I use golbal VPN solve the problem, I find the program use some google apis .