zhejz / carla-roach

Roach: End-to-End Urban Driving by Imitating a Reinforcement Learning Coach. ICCV 2021.
https://zhejz.github.io/roach
Other
274 stars 50 forks source link

CUDA error: out of memory #15

Closed neilsambhu closed 2 years ago

neilsambhu commented 2 years ago

I have collected NoCrash-dense data successfully: https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/run/data_collect_bc_NeilBranch0.sh https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/data_collect_NeilBranch0.py

When I run my version of train_rl.py ( https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/train_rl_NeilBranch0.py ), I get the following error: Traceback (most recent call last): File "train_rl_NeilBranch0.py", line 87, in main agent = AgentClass('config_agent.yaml') File "/home/nsambhu/github/carla-roach/agents/rl_birdview/rl_birdview_agent.py", line 31, in init self.setup(path_to_conf_file) File "/home/nsambhu/github/carla-roach/agents/rl_birdview/rl_birdview_agent.py", line 205, in setup self._policy, self._train_cfg['kwargs'] = self._policy_class.load(self._ckpt) File "/home/nsambhu/github/carla-roach/agents/rl_birdview/models/ppo_policy.py", line 226, in load saved_variables = th.load(path, map_location=device) File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/torch/serialization.py", line 529, in load return _legacy_load(opened_file, map_location, pickle_module, *pickle_load_args) File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load result = unpickler.load() File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/torch/serialization.py", line 737, in restore_location return default_restore_location(storage, map_location) File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location result = fn(storage, location) File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/torch/serialization.py", line 136, in _cuda_deserialize return storage_type(obj.size()) File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/torch/cuda/init.py", line 480, in _lazy_new return super(_CudaBase, cls).new(cls, args, **kwargs) RuntimeError: CUDA error: out of memory

My shell script to call train_rl.py is listed here: https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/run/train_rl_NeilBranch0.sh

I have already reduced the batch size from 256 to 1 and the error persists: https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/config/agent/ppo/training/ppo.yaml

Output from ( https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/train_rl_NeilBranch0.py#L78 ) to show the batch size decreased: cfg.agent[agent_name] {'entry_point': 'agents.rl_birdview.rl_birdview_agent:RlBirdviewAgent', 'wb_run_path': '', 'wb_ckpt_step': None, 'env_wrapper': {'entry_point': 'agents.rl_birdview.utils.rl_birdview_wrapper:RlBirdviewWrapper', 'kwargs': {'input_states': ['control', 'vel_xy'], 'acc_as_action': True}}, 'policy': {'entry_point': 'agents.rl_birdview.models.ppo_policy:PpoPolicy', 'kwargs': {'policy_head_arch': [256, 256], 'value_head_arch': [256, 256], 'features_extractor_entry_point': 'agents.rl_birdview.models.torch_layers:XtMaCNN', 'features_extractor_kwargs': {'states_neurons': [256, 256]}, 'distribution_entry_point': 'agents.rl_birdview.models.distributions:BetaDistribution', 'distribution_kwargs': {'dist_init': None}}}, 'training': {'entry_point': 'agents.rl_birdview.models.ppo:PPO', 'kwargs': {'learning_rate': 1e-05, 'n_steps_total': 12288, 'batch_size': 1, 'n_epochs': 20, 'gamma': 0.99, 'gae_lambda': 0.9, 'clip_range': 0.2, 'clip_range_vf': None, 'ent_coef': 0.01, 'explore_coef': 0.05, 'vf_coef': 0.5, 'max_grad_norm': 0.5, 'target_kl': 0.01, 'update_adv': False, 'lr_schedule_step': 8}}, 'obs_configs': {'birdview': {'module': 'birdview.chauffeurnet', 'width_in_pixels': 192, 'pixels_ev_to_bottom': 40, 'pixels_per_meter': 5.0, 'history_idx': [-16, -11, -6, -1], 'scale_bbox': True, 'scale_mask_col': 1.0}, 'speed': {'module': 'actor_state.speed'}, 'control': {'module': 'actor_state.control'}, 'velocity': {'module': 'actor_state.velocity'}}}

neilsambhu commented 2 years ago

As a workaround, I configured ( https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/config/train_envs/endless_all.yaml ) to train on only Town01. I get the following error when running one loop of ( https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/run/train_rl_NeilBranch0.sh ).

(carla) nsambhu@SAMBHU19:~/github/carla-roach$ run/train_rl_NeilBranch0.sh>out.txt
wandb: ⭐️ View project at https://wandb.ai/neilsambhu/carla-roach-outputs_2022-06-15_18-51-04
wandb: 🚀 View run at https://wandb.ai/neilsambhu/carla-roach-outputs_2022-06-15_18-51-04/runs/2i4242s0
run/train_rl_NeilBranch0.sh: line 3: 96849 Segmentation fault      (core dumped) python -u train_rl_NeilBranch0.py agent.ppo.wb_run_path=null wb_project=train_rl_experts wb_name=roach agent/ppo/policy=xtma_beta agent.ppo.training.kwargs.explore_coef=0.05 carla_sh_path=${CARLA_ROOT}/CarlaUE4.sh
 PYTHON_RETURN=139!!! Start Over!!!
Neil start here 1
Neil start here 1
[2022-06-15 18:51:05,629][utils.server_utils][INFO] - Kill Carla Servers!
Neil left here 1
Neil start here 2
cfg.train_envs [{'env_id': 'Endless-v0', 'env_configs': {'carla_map': 'Town01', 'num_zombie_vehicles': [0, 150], 'num_zombie_walkers': [0, 300], 'weather_group': 'dynamic_1.0'}, 'gpu': [0]}]
[2022-06-15 18:51:06,648][utils.server_utils][INFO] - Kill Carla Servers!
[2022-06-15 18:51:06,648][utils.server_utils][INFO] - CUDA_VISIBLE_DEVICES=0 bash /opt/carla-simulator/CarlaUE4.sh -fps=10 -quality-level=Epic -carla-rpc-port=2000
Neil left here 2
Neil start here 3
Neil left here 3
Neil start here 4
Neil left here 4
Neil start here 5
Neil left here 5
Neil start here 6.0
cfg.agent[agent_name] {'entry_point': 'agents.rl_birdview.rl_birdview_agent:RlBirdviewAgent', 'wb_run_path': None, 'wb_ckpt_step': None, 'env_wrapper': {'entry_point': 'agents.rl_birdview.utils.rl_birdview_wrapper:RlBirdviewWrapper', 'kwargs': {'input_states': ['control', 'vel_xy'], 'acc_as_action': True}}, 'policy': {'entry_point': 'agents.rl_birdview.models.ppo_policy:PpoPolicy', 'kwargs': {'policy_head_arch': [256, 256], 'value_head_arch': [256, 256], 'features_extractor_entry_point': 'agents.rl_birdview.models.torch_layers:XtMaCNN', 'features_extractor_kwargs': {'states_neurons': [256, 256]}, 'distribution_entry_point': 'agents.rl_birdview.models.distributions:BetaDistribution', 'distribution_kwargs': {'dist_init': None}}}, 'training': {'entry_point': 'agents.rl_birdview.models.ppo:PPO', 'kwargs': {'learning_rate': 1e-05, 'n_steps_total': 12288, 'batch_size': 256, 'n_epochs': 20, 'gamma': 0.99, 'gae_lambda': 0.9, 'clip_range': 0.2, 'clip_range_vf': None, 'ent_coef': 0.01, 'explore_coef': 0.05, 'vf_coef': 0.5, 'max_grad_norm': 0.5, 'target_kl': 0.01, 'update_adv': False, 'lr_schedule_step': 8}}, 'obs_configs': {'birdview': {'module': 'birdview.chauffeurnet', 'width_in_pixels': 192, 'pixels_ev_to_bottom': 40, 'pixels_per_meter': 5.0, 'history_idx': [-16, -11, -6, -1], 'scale_bbox': True, 'scale_mask_col': 1.0}, 'speed': {'module': 'actor_state.speed'}, 'control': {'module': 'actor_state.control'}, 'velocity': {'module': 'actor_state.velocity'}}}
cfg.agent[agent_name].entry_point agents.rl_birdview.rl_birdview_agent:RlBirdviewAgent
Neil 6.1
type(AgentClass) <class 'type'>
AgentClass <class 'agents.rl_birdview.rl_birdview_agent.RlBirdviewAgent'>
Neil 6.2
Neil 6.3
Neil left here 6.0
Neil start here 7
Neil left here 7
Neil start here 8
Neil left here 8
[2022-06-15 18:51:11,703][__main__][INFO] - making port 2000
calling registration.py > make(id, **kwargs)
Neil start here 100
Neil left here 100
Neil start here 200
Neil left here 200
Neil /home/nsambhu/github/carla-roach/agents/rl_birdview/rl_birdview_agent.py:256
trainable parameters: 1.53M
Neil /home/nsambhu/github/carla-roach/agents/rl_birdview/models/ppo.py:216
n_epoch: 0, num_timesteps: 12288
n_epoch: 1, num_timesteps: 24576
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
Vehicle agent not added to the crowd by some problem! 
n_epoch: 2, num_timesteps: 36864
n_epoch: 3, num_timesteps: 49152
n_epoch: 4, num_timesteps: 61440
n_epoch: 5, num_timesteps: 73728
n_epoch: 6, num_timesteps: 86016
n_epoch: 7, num_timesteps: 98304
n_epoch: 8, num_timesteps: 110592
[2022-06-15 21:30:12,755][carla_gym.core.zombie_walker.zombie_walker_handler][WARNING] - Carla/Maps/Town01: Spawning zombie walkers max trial 10 reached! spawned/to_spawn: 119/120

I don't know why the epoch 1 has an error, epochs 2-7 work, and epoch 8 has a fatal error. When I try to run training again without deleting the outputs/checkpoint.txt file, I get the following error:

Traceback (most recent call last):
  File "train_rl_NeilBranch0.py", line 135, in main
    wb_callback = WandbCallback(cfg, env)
  File "/home/nsambhu/github/carla-roach/agents/rl_birdview/utils/wandb_callback.py", line 23, in __init__
    wandb.config.update(OmegaConf.to_container(cfg))
  File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/wandb/sdk/wandb_config.py", line 178, in update
    sanitized = self._update(d, allow_val_change)
  File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/wandb/sdk/wandb_config.py", line 172, in _update
    parsed_dict, allow_val_change, ignore_keys=locked_keys
  File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/wandb/sdk/wandb_config.py", line 231, in _sanitize_dict
    k, v = self._sanitize(k, v, allow_val_change)
  File "/home/nsambhu/anaconda3/envs/carla/lib/python3.7/site-packages/wandb/sdk/wandb_config.py", line 257, in _sanitize
    ).format(key, self._items[key], val)
wandb.sdk.lib.config_util.ConfigError: Attempted to change value of key "actors" from {'hero': {'coach': None, 'driver': 'ppo', 'reward': {'entry_point': 'reward.valeo_action:ValeoAction'}, 'terminal': {'kwargs': {'max_time': 300, 'no_run_rl': False, 'no_run_stop': False, 'no_collision': True}, 'entry_point': 'terminal.leaderboard_dagger:LeaderboardDagger'}}} to {'hero': {'agent': 'ppo', 'reward': {'entry_point': 'reward.valeo_action:ValeoAction', 'kwargs': {}}, 'terminal': {'entry_point': 'terminal.valeo_no_det_px:ValeoNoDetPx', 'kwargs': {}}}}
If you really want to do this, pass allow_val_change=True to config.update()

I changed the wandb.init in ( https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/agents/rl_birdview/utils/wandb_callback.py#L23 ) to allow_val_change=True. I get the same error.

How can I train RL experts?

neilsambhu commented 2 years ago

I deleted the outputs/checkpoint.txt file and re-ran the train_rl_NeilBranch0.sh script. I see the wandb URL changed from

wandb: ⭐️ View project at https://wandb.ai/neilsambhu/carla-roach-outputs_2022-06-15_18-51-04
wandb: 🚀 View run at https://wandb.ai/neilsambhu/carla-roach-outputs_2022-06-15_18-51-04/runs/2i4242s0

to

wandb: ⭐️ View project at https://wandb.ai/neilsambhu/train_rl_experts
wandb: 🚀 View run at https://wandb.ai/neilsambhu/train_rl_experts/runs/29472vhr

.

I assume this means the allow_val_change=True has taken effect. I will mark this issue as resolved once I see I'm able to train across multiple towns.

neilsambhu commented 2 years ago

I resolved the checkpoint issue by adding the allow_val_change to the first call to the wandb.config.update() function: https://github.com/neilsambhu/carla-roach/blob/NeilBranch0/agents/rl_birdview/utils/wandb_callback.py#L32 . I should be able to resume training and train across multiple towns, 1 town for each training session.