shuxjweb commented 11 months ago

Hi, I encountered an error while running finetune.py, the program died, what is the reason?

python -m torch.distributed.run --nproc_per_node=8 finetune.py \ --model-type ram_plus \ --config ram/configs/finetune.yaml \ --checkpoint /models/RAM/ram_plus_swin_large_14m.pth \ --output-dir /logs/RAM/20231205_ramplus_coco_finetune

(p38t20) [root@ts-80e08ce490704c3aa7d3ca229319b5a9-launcher /recognize_anything]# sh start.sh WARNING:main:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

2023-12-06 15:56:54.766918: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.779575: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.783312: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.811370: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.811370: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.819133: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.831586: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.835100: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.847594: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.847615: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.847626: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-06 15:56:54.862773: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.862790: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.898752: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.898803: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:54.899451: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-06 15:56:55.554178: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.574930: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.578458: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.600197: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.600411: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.633909: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.636835: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-12-06 15:56:55.639246: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT | distributed init (rank 1, word 8): env:// | distributed init (rank 4, word 8): env:// | distributed init (rank 2, word 8): env:// | distributed init (rank 5, word 8): env:// | distributed init (rank 0, word 8): env:// | distributed init (rank 3, word 8): env:// | distributed init (rank 7, word 8): env:// | distributed init (rank 6, word 8): env:// Creating dataset loading /data/img_txt/recognize-anything-dataset-14m/coco_train_rmcocodev_ram.json number of training samples: 547741 Creating model load from: /models/RAM/ram_plus_swin_large_14m.pth Creating pretrained CLIP model Creating RAM model

/models/RAM/ram_plus_swin_large_14m.pth

load checkpoint from /models/RAM/ram_plus_swin_large_14m.pth vit: swin_l Start training WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46292 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46293 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46294 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46295 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46297 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46298 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46299 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 4 (pid: 46296) of binary: /root/miniconda3/envs/p38t20/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/p38t20/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/p38t20/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 798, in main() File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:

---------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-12-06_15:57:22 host : ts-80e08ce490704c3aa7d3ca229319b5a9-launcher rank : 4 (local_rank: 4) exitcode : -11 (pid: 46296) error_file: traceback : Signal 11 (SIGSEGV) received by PID 46296 ==========================================================

shuxjweb commented 11 months ago

I have conducted several experiments for running finetune.py:

1GPU, image_size=224, the code can run successful.
1GPU, image_size=384, the code can run successful.
8GPU, image_size=224, the code can run successful.
8GPU, image_size=384, the running kill be killed. The reported errors are shown as follows: """ Start training WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17968 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17970 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17971 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17972 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17973 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17974 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17975 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 17969) of binary: /root/miniconda3/envs/p38t20/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/p38t20/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/p38t20/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 798, in main() File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/p38t20/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune_fp16.py FAILED

Failures:
---------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-12-07_00:39:25 host : ts-80e08ce490704c3aa7d3ca229319b5a9-launcher rank : 1 (local_rank: 1) exitcode : -11 (pid: 17969) error_file: traceback : Signal 11 (SIGSEGV) received by PID 17969 ========================================================== """

xinyu1205 commented 11 months ago

This seems to be an environmental issue instead of a code issue, and I have never encountered this issue before.

shuxjweb commented 11 months ago

Thank you for your reply. I feel very strange, single-machine multi-gpu, the image size set to 224 can run normally, but change to 384 will report an error. I have checked the GPU memory, because the batchsize setting is small (2), when setting 384, the memory occupation is less than half, should not be caused by insufficient GPU memory.

xinyu1205 / recognize-anything

Error when run finetune.py #130

/models/RAM/ram_plus_swin_large_14m.pth

finetune.py FAILED

finetune_fp16.py FAILED