Open shuxjweb opened 11 months ago
I have conducted several experiments for running finetune.py:
Failures:
This seems to be an environmental issue instead of a code issue, and I have never encountered this issue before.
Thank you for your reply. I feel very strange, single-machine multi-gpu, the image size set to 224 can run normally, but change to 384 will report an error. I have checked the GPU memory, because the batchsize setting is small (2), when setting 384, the memory occupation is less than half, should not be caused by insufficient GPU memory.
Hi, I encountered an error while running finetune.py, the program died, what is the reason?
python -m torch.distributed.run --nproc_per_node=8 finetune.py \ --model-type ram_plus \ --config ram/configs/finetune.yaml \ --checkpoint /models/RAM/ram_plus_swin_large_14m.pth \ --output-dir /logs/RAM/20231205_ramplus_coco_finetune
(p38t20) [root@ts-80e08ce490704c3aa7d3ca229319b5a9-launcher /recognize_anything]# sh start.sh WARNING:main:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
2023-12-06 15:56:54.766918: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable
TF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.779575: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.783312: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.811370: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.811370: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.819133: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.831586: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.835100: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.847594: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.847615: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.847626: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0
. 2023-12-06 15:56:54.862773: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.862790: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.898752: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.898803: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.899451: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:55.554178: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.574930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.578458: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.600197: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.600411: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.633909: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.636835: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.639246: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT | distributed init (rank 1, word 8): env:// | distributed init (rank 4, word 8): env:// | distributed init (rank 2, word 8): env:// | distributed init (rank 5, word 8): env:// | distributed init (rank 0, word 8): env:// | distributed init (rank 3, word 8): env:// | distributed init (rank 7, word 8): env:// | distributed init (rank 6, word 8): env:// Creating dataset loading /data/img_txt/recognize-anything-dataset-14m/coco_train_rmcocodev_ram.json number of training samples: 547741 Creating model load from: /models/RAM/ram_plus_swin_large_14m.pth Creating pretrained CLIP model Creating RAM model/models/RAM/ram_plus_swin_large_14m.pth
load checkpoint from /models/RAM/ram_plus_swin_large_14m.pth vit: swin_l Start training WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46292 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46295 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46297 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46298 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46299 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 4 (pid: 46296) of binary: /root/miniconda3/envs/p38t20/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/p38t20/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/p38t20/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 798, in
main()
File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune.py FAILED
Failures: