NathanJHLee commented 6 months ago

Hi, My name is Nathan and i try to do training part according to "wespeaker/examples/voxceleb/v2". I got few error.

[Errno 28] No space left on device.
[error] checkpoint is null !

My env is single node A100(80G) X 8 and also enough local volumes.

Using a Docker container torch 1.12.1 torchaudio 0.12.1 torchnet 0.0.4

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0

Please check My error log as below. Thank you

(wespeaker) [asr@e7bcf3a85e2c v2]$ bash run.sh Start training ... /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) /home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/joblib/_multiprocessing_helpers.py:46: UserWarning: [Errno 28] No space left on device. joblib will operate in serial mode warnings.warn('%s. joblib will operate in serial mode' % (e,)) [warning] exp/CAMPPlus-TSTP-emb512-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-SGD-epoch150/models already exists !!! [error] checkpoint is null ! WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 240349 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 240350 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 240351 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 240352 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 240353 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 240354 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 240355 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 240348) of binary: /home/asr/miniconda3/envs/wespeaker/bin/python Traceback (most recent call last): File "/home/asr/miniconda3/envs/wespeaker/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/asr/miniconda3/envs/wespeaker/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

wespeaker/bin/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-16_09:42:09 host : e7bcf3a85e2c rank : 0 (local_rank: 0) exitcode : 1 (pid: 240348) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

NathanJHLee commented 6 months ago

Sorry, I didn't check shm volume. I solved this error by making new container and add option "--shm-size=100gb". It works fine. :D

NathanJHLee commented 6 months ago

I will close this issue

wenet-e2e / wespeaker

Got error when i did run.sh training part #277

wespeaker/bin/train.py FAILED