Closed tailangjun closed 10 months ago
详细的报错信息
RROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). | 0/10 [00:00<?, ?step/s]
Traceback (most recent call last):
File "/opt/anaconda3/envs/geneface/lib/python3.9/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/anaconda3/envs/geneface/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 355, in reduce_storage
metadata = storage._share_filenamecpu()
RuntimeError: unable to write to file : No space left on device (28)
Valid: 10%|███████████████▎ | 1/10 [00:01<00:13, 1.49s/step]
| Validation results@6000: {'total_loss': 0.1895, 'mse': 0.0317, 'sync': 0.1577} | 1/10 [00:01<00:12, 1.42s/step]
08/31 09:09:06 PM Epoch 00104@6000: saving model to checkpoints/Macron/lm3d_postnet_sync/model_ckpt_steps_6000.ckpt
6000step [00:13, 2.67step/s, adv=0.219, disc_fake_loss=0.34, disc_neg_conf=0.561, disc_pos_conf=0.577, disc_true_loss=0.189, lr_0=0.000473, lr_1=0.0001, mse=0.0242, sync=0.0759]
Traceback (most recent call last):
File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/tasks/run.py", line 19, in
网上的说法是共享内存不够,可查看闪退前磁盘使用情况 Every 1.0s: df -h ubuntu: Thu Aug 31 22:01:37 2023
Filesystem Size Used Avail Use% Mounted on tmpfs 6.3G 4.9M 6.3G 1% /run /dev/nvme0n1p3 3.4T 2.0T 1.3T 60% / tmpfs 32G 30G 1.5G 96% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock /dev/nvme0n1p4 477M 7.0M 470M 2% /boot/efi tmpfs 6.3G 96K 6.3G 1% /run/user/1001
这个 /dev/shm的消耗一点一点涨起来的,锤实了,就是/dev/shm导致的
解决方法可从两个方面入手
修改shm-size 如果是实体机的话,修改 /etc/fstab,可参考 https://stackoverflow.com/questions/58804022/how-to-resize-dev-shm 如果是Docker的话,在 Docker启动时设置参数 --shm-size即可
修改DataLoader中参数num_workers的值 dataloader = torch.utils.data.DataLoader( dataset, batch_size=16, shuffle=True, num_workers=0, pin_memory=True, collate_fn=dataset.collate_fn )
请问这个/dev/shm一直持续涨到程序自动退出,是正常的吗?我这边云服务器46G的空间都用完了
执行命令 python tasks/run.py --config=egs/datasets/videos/Macron/lm3d_postnet_sync.yaml --exp_name=Macron/lm3d_postnet_sync 提示信息 RuntimeError: DataLoader worker (pid 8187) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
按照以往的经验是要把batch_size调小点,不知道是哪个配置文件的哪个参数,知道的兄弟麻烦解答一下哈,谢谢啦