weipu-zhang / STORM

34 stars 6 forks source link

Broken pipe #2

Closed robotzheng closed 4 months ago

robotzheng commented 4 months ago

./train.sh Namespace(n='MsPacman-life_done-wm_2L512D8H-100k-seed1', seed=1, config_path='config_files/STORM.yaml', env_name='ALE/MsPacman-v5', trajectory_path='D_TRAJ/MsPacman.pkl') A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7) [Powered by Stella] Current env: ALE/MsPacman-v5 0%| | 0/102000 [00:00<?, ?it/s]Saving model at total steps 0 1%|▊ | 985/102000 [00:01<01:54, 882.92it/s]./train.sh: line 7: 38788 Floating point exceptionpython -u train.py -n "${env_name}-life_done-wm_2L512D8H-100k-seed1" -seed 1 -config_path "config_files/STORM.yaml" -env_name "ALE/${env_name}-v5" -trajectory_path "D_TRAJ/${env_name}.pkl" (safe-rlhf) oppoer@task-20240105100221-21140:/home/notebook/code/personal/80306170/AGI/STORM$ Process Worker-0: Traceback (most recent call last): File "/opt/conda/envs/safe-rlhf/lib/python3.9/site-packages/gymnasium/vector/async_vector_env.py", line 626, in _worker_shared_memory command, data = pipe.recv() File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/conda/envs/safe-rlhf/lib/python3.9/site-packages/gymnasium/vector/async_vector_env.py", line 685, in _worker_shared_memory pipe.send((None, False)) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

weipu-zhang commented 4 months ago

I haven't encountered such an issue before with several different environments and devices. Seems like something about the gymnasium's vec_env.

Could you provide more information about your hardware and the version of pytorch & gymnasium? Also, do you have any idea about the Floating point exception in your log file?

robotzheng commented 4 months ago

python 3.9.18 gymnasium 0.29.1 torch 2.1.1+cu118

robotzheng commented 4 months ago

Freeway is also down, but it has more iterations( 1019)。 ./train.shNamespace(n='Freeway-life_done-wm_2L512D8H-100k-seed1', seed=1, config_path='config_files/STORM.yaml', env_name='ALE/Freeway-v5', trajectory_path='D_TRAJ/Freeway.pkl') A.L.E: Arcade Learning Environment (version 0.8.1+53f58b7) [Powered by Stella] Current env: ALE/Freeway-v5 0%| | 0/102000 [00:00<?, ?it/s]Saving model at total steps 0 1%|▉ | 1019/102000 [00:08<14:06, 119.23it/s]./train.sh: line 7: 48677 Floating point exceptionpython -u train.py -n "${env_name}-life_done-wm_2L512D8H-100k-seed1" -seed 1 -config_path "config_files/STORM.yaml" -env_name "ALE/${env_name}-v5" -trajectory_path "D_TRAJ/${env_name}.pkl" Process Worker-0: (safe-rlhf) oppoer@task-20240105100221-21140:/home/notebook/code/personal/80306170/AGI/STORM$ Traceback (most recent call last): File "/opt/conda/envs/safe-rlhf/lib/python3.9/site-packages/gymnasium/vector/async_vector_env.py", line 626, in _worker_shared_memory command, data = pipe.recv() File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/conda/envs/safe-rlhf/lib/python3.9/site-packages/gymnasium/vector/async_vector_env.py", line 685, in _worker_shared_memory pipe.send((None, False)) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/opt/conda/envs/safe-rlhf/lib/python3.9/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

weipu-zhang commented 4 months ago

I can't tell you what the solution is based on the information that you gave me. The versions seem correct to me.

But when the training step is smaller than 1024, the STORM is still in the warmup phase and it's not trained or evaluated. So the issue should be something related to the gymnasium and hardware (perhaps some incompatible issue) and should not be related to the algorithm. I suggest trying to remove/comment all the model-related parts and see if the vec_env works as expected.

I'll close this thread for now. If you have other concerns or anyone finds the same problem, feel free to reopen this.