ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
18.46k stars 1.87k forks source link

一机多卡执行训练报错,torchrun 的 --nproc_per_node 配置`2`时正常,配置为大于`2`的数值后报错 #722

Closed shibingli closed 1 year ago

shibingli commented 1 year ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

LLaMA-Plus-7B

操作系统

Linux

详细描述问题

1机10卡执行训练报错,torchrun 的 --nproc_per_node 配置2时正常,配置为大于2的数值后报错。请问哪位朋友处理过类似问题。 依赖库的版本目前看也都是正常的,但是执行大于2卡训练的命令时会报错。

# 执行训练命令

root@d91c734b9499:/opt/app# bash /data/agi/Chinese-LLaMA-Alpaca/scripts/training/run_pt.sh

#  run_pt.sh 内容

lr=2e-4
lora_rank=8
lora_alpha=32
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model=/data/agi/LoRA/hf/7B_Llama_Plus
chinese_tokenizer_path=/data/agi/LoRA/hf/7B_Llama_Plus
dataset_dir=/data/agi/nh/text/txt
data_cache=/data/agi/nh/cache/txt
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=8
output_dir=/data/agi/nh/model/txt

deepspeed_config_file=/data/agi/Chinese-LLaMA-Alpaca/scripts/training/ds_zero2_no_offload.json

export WANDB_DISABLED=true
export OMP_NUM_THREADS=1
#export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9

rm -Rf /data/agi/nh/cache/txt/*
rm -Rf /data/agi/nh/model/txt/*

# --nproc_per_node 配置为`2`时正常执行,配置成大于`2`的数值时报错,`7`、`8`和`10`都不可用。
torchrun --nnodes 1 --nproc_per_node 8 /data/agi/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir ${data_cache} \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 200 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size 512 \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --modules_to_save ${modules_to_save} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --gradient_checkpointing \
    --ddp_find_unused_parameters False

内存信息:

image
root@d91c734b9499:/opt/app# free -mh
               total        used        free      shared  buff/cache   available
Mem:           1.5Ti       3.7Gi       1.5Ti       135Mi       5.0Gi       1.5Ti
Swap:          4.0Gi          0B       4.0Gi

依赖情况(代码类问题务必提供)

# 依赖全部按照Wiki安装

root@d91c734b9499:/opt/app# pip list
Package            Version      Editable project location
------------------ ------------ --------------------------
accelerate         0.20.3
aiofiles           23.1.0
aiohttp            3.8.4
aiosignal          1.3.1
altair             5.0.1
anyio              3.7.1
appdirs            1.4.4
async-timeout      4.0.2
attrs              23.1.0
certifi            2022.12.7
charset-normalizer 2.1.1
click              8.1.3
cmake              3.25.0
contourpy          1.1.0
cycler             0.11.0
datasets           2.13.0
deepspeed          0.9.4
dill               0.3.6
docker-pycreds     0.4.0
exceptiongroup     1.1.1
fastapi            0.99.1
ffmpy              0.3.0
filelock           3.9.0
fonttools          4.40.0
frozenlist         1.3.3
fsspec             2023.6.0
gitdb              4.0.10
GitPython          3.1.31
gradio             3.35.2
gradio_client      0.2.7
h11                0.14.0
hjson              3.1.0
httpcore           0.17.3
httpx              0.24.1
huggingface-hub    0.15.1
idna               3.4
iniconfig          2.0.0
Jinja2             3.1.2
joblib             1.2.0
jsonschema         4.17.3
kiwisolver         1.4.4
latex2mathml       3.76.0
linkify-it-py      2.0.2
lit                15.0.7
Markdown           3.4.3
markdown-it-py     2.2.0
MarkupSafe         2.1.2
matplotlib         3.7.2
mdit-py-plugins    0.3.3
mdtex2html         1.2.0
mdurl              0.1.2
mpmath             1.2.1
multidict          6.0.4
multiprocess       0.70.14
networkx           3.0
ninja              1.11.1
numpy              1.24.1
orjson             3.9.1
packaging          23.1
pandas             2.0.2
pathtools          0.1.2
peft               0.3.0.dev0   /opt/app/envs/peft_13e53fc
Pillow             9.3.0
pip                23.1.2
pluggy             1.0.0
protobuf           4.23.3
psutil             5.9.5
py-cpuinfo         9.0.0
pyarrow            12.0.1
pydantic           1.10.9
pydub              0.25.1
Pygments           2.15.1
pyparsing          3.0.9
pyrsistent         0.19.3
pytest             7.3.2
python-dateutil    2.8.2
python-multipart   0.0.6
pytz               2023.3
PyYAML             6.0
regex              2023.6.3
requests           2.28.1
safetensors        0.3.1
scikit-learn       1.2.2
scipy              1.10.1
semantic-version   2.10.0
sentencepiece      0.1.99
sentry-sdk         1.25.1
setproctitle       1.3.2
setuptools         59.6.0
six                1.16.0
smmap              5.0.0
sniffio            1.3.0
starlette          0.27.0
sympy              1.11.1
threadpoolctl      3.1.0
tokenizers         0.13.3
tomli              2.0.1
toolz              0.12.0
torch              2.0.0+cu118
torchaudio         2.0.1+cu118
torchvision        0.15.1+cu118
tqdm               4.65.0
transformers       4.30.2
triton             2.0.0
typing_extensions  4.7.1
tzdata             2023.3
uc-micro-py        1.0.2
urllib3            1.26.13
uvicorn            0.22.0
wandb              0.15.4
websockets         11.0.3
xxhash             3.2.0
yarl               1.9.2
image
# nvidia-smi 信息:

[shibingli@loaclhost ~]$ sudo docker run --gpus=all --runtime=nvidia --rm -it -v /data/:/data/ -v /data/agi/Chinese-LLaMA-Alpaca-Docker/envs/:/opt/app/envs/ rl-agi:latest bash
[sudo] shibingli 的密码:

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

root@d91c734b9499:/opt/app# nvidia-smi
Fri Jul  7 09:23:35 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800 80GB PCIe          Off | 00000000:12:00.0 Off |                    0 |
| N/A   34C    P0              42W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80GB PCIe          Off | 00000000:13:00.0 Off |                    0 |
| N/A   33C    P0              44W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A800 80GB PCIe          Off | 00000000:14:00.0 Off |                    0 |
| N/A   35C    P0              47W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A800 80GB PCIe          Off | 00000000:48:00.0 Off |                    0 |
| N/A   34C    P0              42W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A800 80GB PCIe          Off | 00000000:49:00.0 Off |                    0 |
| N/A   34C    P0              43W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A800 80GB PCIe          Off | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0              44W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A800 80GB PCIe          Off | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0              42W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A800 80GB PCIe          Off | 00000000:C0:00.0 Off |                    0 |
| N/A   34C    P0              45W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   8  NVIDIA A800 80GB PCIe          Off | 00000000:C1:00.0 Off |                    0 |
| N/A   33C    P0              44W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   9  NVIDIA A800 80GB PCIe          Off | 00000000:C2:00.0 Off |                    0 |
| N/A   34C    P0              44W / 300W |     18MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

运行日志或截图

# 运行日志

root@d91c734b9499:/opt/app# bash /data/agi/Chinese-LLaMA-Alpaca/scripts/training/run_pt.sh
[2023-07-07 09:29:01,360] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-07 09:29:01,364] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-07 09:29:01,365] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-07 09:29:01,365] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-07 09:29:01,398] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-07 09:29:01,414] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-07 09:29:01,417] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-07 09:29:03,691] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-07 09:29:03,691] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-07 09:29:03,785] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-07 09:29:03,785] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-07 09:29:03,853] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-07 09:29:03,854] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-07 09:29:03,862] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-07 09:29:03,862] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-07 09:29:03,864] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-07 09:29:03,864] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-07 09:29:03,864] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-07 09:29:03,867] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-07 09:29:03,867] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-07-07 09:29:03,871] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-07 09:29:03,871] [INFO] [comm.py:594:init_distributed] cdb=None
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
07/07/2023 09:29:04 - WARNING - __main__ - Process rank: 5, device: cuda:5, n_gpu: 1distributed training: True, 16-bits training: True
07/07/2023 09:29:04 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True
07/07/2023 09:29:04 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:667] 2023-07-07 09:29:04,229 >> loading configuration file /data/agi/LoRA/hf/7B_Llama_Plus/config.json
[INFO|configuration_utils.py:725] 2023-07-07 09:29:04,229 >> Model config LlamaConfig {
  "_name_or_path": "/data/agi/LoRA/hf/7B_Llama_Plus",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.30.2",
  "use_cache": true,
  "vocab_size": 49953
}

[INFO|tokenization_utils_base.py:1821] 2023-07-07 09:29:04,230 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:1821] 2023-07-07 09:29:04,230 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1821] 2023-07-07 09:29:04,230 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1821] 2023-07-07 09:29:04,230 >> loading file tokenizer_config.json
07/07/2023 09:29:04 - WARNING - __main__ - Process rank: 4, device: cuda:4, n_gpu: 1distributed training: True, 16-bits training: True
07/07/2023 09:29:04 - WARNING - __main__ - Process rank: 3, device: cuda:3, n_gpu: 1distributed training: True, 16-bits training: True
07/07/2023 09:29:04 - WARNING - __main__ - Process rank: 6, device: cuda:6, n_gpu: 1distributed training: True, 16-bits training: True
07/07/2023 09:29:05 - WARNING - __main__ - Process rank: 2, device: cuda:2, n_gpu: 1distributed training: True, 16-bits training: True
07/07/2023 09:29:05 - INFO - datasets.builder - Using custom data configuration default-4e021b6fe6b72b11
07/07/2023 09:29:05 - INFO - datasets.info - Loading Dataset Infos from /opt/app/envs/venv_peft_13e53fc/lib/python3.10/site-packages/datasets/packaged_modules/text
07/07/2023 09:29:05 - INFO - datasets.builder - Generating dataset text (/data/agi/nh/cache/txt/knowledge_text/text/default-4e021b6fe6b72b11/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2)
Downloading and preparing dataset text/default to /data/agi/nh/cache/txt/knowledge_text/text/default-4e021b6fe6b72b11/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...
Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11275.01it/s]
07/07/2023 09:29:05 - INFO - datasets.download.download_manager - Downloading took 0.0 min
07/07/2023 09:29:05 - INFO - datasets.download.download_manager - Checksum Computation took 0.0 min
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1593.58it/s]
07/07/2023 09:29:05 - INFO - datasets.builder - Generating train split
07/07/2023 09:29:05 - INFO - datasets.utils.info_utils - Unable to verify splits sizes.
Dataset text downloaded and prepared to /data/agi/nh/cache/txt/knowledge_text/text/default-4e021b6fe6b72b11/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 848.53it/s]
07/07/2023 09:29:05 - INFO - __main__ - knowledge.txt has been loaded
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #0 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00000_of_00008.arrow
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #1 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00001_of_00008.arrow
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #2 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00002_of_00008.arrow
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #3 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00003_of_00008.arrow
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #4 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00004_of_00008.arrow
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #5 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00005_of_00008.arrow
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #6 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00006_of_00008.arrow
07/07/2023 09:29:05 - INFO - datasets.arrow_dataset - Process #7 will write at /data/agi/nh/cache/txt/knowledge_text/tokenized_00007_of_00008.arrow
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1337 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 1336) of binary: /opt/app/envs/venv_peft_13e53fc/bin/python
Traceback (most recent call last):
  File "/opt/app/envs/venv_peft_13e53fc/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/app/envs/venv_peft_13e53fc/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/app/envs/venv_peft_13e53fc/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/app/envs/venv_peft_13e53fc/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/app/envs/venv_peft_13e53fc/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/app/envs/venv_peft_13e53fc/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/data/agi/Chinese-LLaMA-Alpaca/scripts/training/run_clm_pt_with_peft.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-07_09:29:19
  host      : d91c734b9499
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 1338)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1338
[2]:
  time      : 2023-07-07_09:29:19
  host      : d91c734b9499
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 1339)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1339
[3]:
  time      : 2023-07-07_09:29:19
  host      : d91c734b9499
  rank      : 4 (local_rank: 4)
  exitcode  : -7 (pid: 1340)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1340
[4]:
  time      : 2023-07-07_09:29:19
  host      : d91c734b9499
  rank      : 5 (local_rank: 5)
  exitcode  : -7 (pid: 1341)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1341
[5]:
  time      : 2023-07-07_09:29:19
  host      : d91c734b9499
  rank      : 6 (local_rank: 6)
  exitcode  : -7 (pid: 1342)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1342
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-07_09:29:19
  host      : d91c734b9499
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 1336)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 1336
============================================================
image
iMountTai commented 1 year ago

你的数据是多大的?

shibingli commented 1 year ago

3万多条,大约400K。数据量有点小了?

你的数据是多大的?

iMountTai commented 1 year ago

我以为你的数据量很大,之前遇到closing signal sigterm都是内存问题。另外,第一次生成数据cache后没必要因为卡数的变动重新生成一次。其它的,您再调试一下吧。

shibingli commented 1 year ago

@iMountTai 好的,我再试一下

shibingli commented 1 year ago

已解决,重点还是数据量问题。