步骤3. 训练PostNet模型时爆显存了，可以修改哪个参数呀？

tailangjun commented 10 months ago

执行命令 python tasks/run.py --config=egs/datasets/videos/Macron/lm3d_postnet_sync.yaml --exp_name=Macron/lm3d_postnet_sync 提示信息 RuntimeError: DataLoader worker (pid 8187) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

按照以往的经验是要把batch_size调小点，不知道是哪个配置文件的哪个参数，知道的兄弟麻烦解答一下哈，谢谢啦

tailangjun commented 10 months ago

详细的报错信息

RROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). | 0/10 [00:00<?, ?step/s] Traceback (most recent call last): File "/opt/anaconda3/envs/geneface/lib/python3.9/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/opt/anaconda3/envs/geneface/lib/python3.9/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/opt/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 355, in reduce_storage metadata = storage._share_filenamecpu() RuntimeError: unable to write to file : No space left on device (28) Valid: 10%|███████████████▎ | 1/10 [00:01<00:13, 1.49s/step] | Validation results@6000: {'total_loss': 0.1895, 'mse': 0.0317, 'sync': 0.1577} | 1/10 [00:01<00:12, 1.42s/step] 08/31 09:09:06 PM Epoch 00104@6000: saving model to checkpoints/Macron/lm3d_postnet_sync/model_ckpt_steps_6000.ckpt 6000step [00:13, 2.67step/s, adv=0.219, disc_fake_loss=0.34, disc_neg_conf=0.561, disc_pos_conf=0.577, disc_true_loss=0.189, lr_0=0.000473, lr_1=0.0001, mse=0.0242, sync=0.0759] Traceback (most recent call last): File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/tasks/run.py", line 19, in run_task() File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/tasks/run.py", line 14, in run_task task_cls.start() File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/utils/commons/base_task.py", line 251, in start trainer.fit(cls) File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/utils/commons/trainer.py", line 122, in fit self.run_single_process(self.task) File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/utils/commons/trainer.py", line 186, in run_single_process self.train() File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/utils/commons/trainer.py", line 285, in train self.run_evaluation() File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/utils/commons/trainer.py", line 201, in run_evaluation self.save_checkpoint(epoch=self.current_epoch, logs=eval_results) File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/utils/commons/trainer.py", line 438, in save_checkpoint self._atomic_save(ckpt_path) File "/home/tailangjun/Documents/AIGenHuman/2DHuman/GeneFace/utils/commons/trainer.py", line 457, in _atomic_save torch.save(checkpoint, tmp_path, _use_new_zipfile_serialization=False) File "/opt/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/serialization.py", line 427, in save _legacy_save(obj, opened_file, pickle_module, pickle_protocol) File "/opt/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/serialization.py", line 571, in _legacy_save storage._write_file(f, _should_read_directly(f), True, torch._utils._element_size(dtype)) File "/opt/anaconda3/envs/geneface/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 8187) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

tailangjun commented 10 months ago

网上的说法是共享内存不够，可查看闪退前磁盘使用情况 Every 1.0s: df -h ubuntu: Thu Aug 31 22:01:37 2023

Filesystem Size Used Avail Use% Mounted on tmpfs 6.3G 4.9M 6.3G 1% /run /dev/nvme0n1p3 3.4T 2.0T 1.3T 60% / tmpfs 32G 30G 1.5G 96% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock /dev/nvme0n1p4 477M 7.0M 470M 2% /boot/efi tmpfs 6.3G 96K 6.3G 1% /run/user/1001

这个 /dev/shm的消耗一点一点涨起来的，锤实了，就是/dev/shm导致的

解决方法可从两个方面入手

修改shm-size 如果是实体机的话，修改 /etc/fstab，可参考 https://stackoverflow.com/questions/58804022/how-to-resize-dev-shm 如果是Docker的话，在 Docker启动时设置参数 --shm-size即可
修改DataLoader中参数num_workers的值 dataloader = torch.utils.data.DataLoader( dataset, batch_size=16, shuffle=True, num_workers=0, pin_memory=True, collate_fn=dataset.collate_fn )

husthzy commented 8 months ago

请问这个/dev/shm一直持续涨到程序自动退出，是正常的吗？我这边云服务器46G的空间都用完了

yerfor / GeneFace

步骤3. 训练PostNet模型时爆显存了，可以修改哪个参数呀？ #192