Closed vv521 closed 8 months ago
训练或推理的时候,在脚本中添加“--bits 4”这一行参数即可
好的,bitsandbytes库是不是只能在Linux上运行呀,Windows老是报错呀。
bitsandbytes库可能在Windows上兼容性不好,我们是在Linux上运行。
bitsandbytes库可能在Windows上兼容性不好,我们是在Linux上运行。 好的,在Linux运行fine_continue.bash出现如下错误:===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8888/jupyter'), PosixPath('http'), PosixPath('//autodl-container-2a5049addd-221c2cd9')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('https'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/3/error.json')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function
errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8888/jupyter'), PosixPath('http'), PosixPath('//autodl-container-2a5049addd-221c2cd9')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com'), PosixPath('https')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/1/error.json')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function
errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('8888/jupyter'), PosixPath('//autodl-container-2a5049addd-221c2cd9')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('8443'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/2/error.json')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function
errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8888/jupyter'), PosixPath('http'), PosixPath('//autodl-container-2a5049addd-221c2cd9')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('https'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com')}
warn(msg)
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/0/error.json')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function
errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
03/01/2024 21:10:31 - WARNING - args.parser - ddp_find_unused_parameters
needs to be set as False for LoRA in DDP training.
[INFO|training_args.py:1299] 2024-03-01 21:10:31,973 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1713] 2024-03-01 21:10:31,973 >> PyTorch: setting up devices
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py:1617: FutureWarning: --push_to_hub_token
is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token
instead.
warnings.warn(
03/01/2024 21:10:31 - INFO - args.parser - Process rank: 0, device: cuda:0, n_gpu: 1
distributed training: True, compute dtype: torch.bfloat16
03/01/2024 21:10:31 - INFO - args.parser - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=
[INFO|tokenization_auto.py:512] 2024-03-01 21:10:31,975 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
Traceback (most recent call last):
File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in
File "/root/autodl-tmp/IEPile/src/finetune.py", line 97, in main
model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
File "/root/autodl-tmp/IEPile/src/args/parser.py", line 65, in get_train_args
tokenizer = tokenizer_class.from_pretrained(
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 667, in from_pretrained
model_args, data_args, training_args, finetuning_args, generating_args = parse_train_args(args)
File "/root/autodl-tmp/IEPile/src/args/parser.py", line 56, in parse_train_args
return parse_args(parser, args)
File "/root/autodl-tmp/IEPile/src/args/parser.py", line 42, in parse_args
return parser.parse_args_into_dataclasses()
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
config = AutoConfig.from_pretrained(
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 983, in from_pretrained
obj = dtype(inputs)
File "
File "TORCH_USE_CUDA_DSA
to enable device-side assertions.
and (self.device.type != "cuda")
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1795, in device
return self._setup_devices
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1739, in _setup_devices
Traceback (most recent call last):
File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1372, in post_init__
and (self.device.type != "cuda")
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1795, in device
return self._setup_devices
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1739, in _setup_devices
self.distributed_state = PartialState(
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/state.py", line 198, in init__
torch.cuda.set_device(self.device)
File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/cuda/init.py", line 350, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Root Cause (first observed failure): [0]: time : 2024-03-01_21:10:34 host : autodl-container-2a5049addd-221c2cd9 rank : 0 (local_rank: 0) exitcode : 1 (pid: 4229) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================请问如何解决呀
错误显示OSError: model/baichuan2-13b-iepile-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/model/baichuan2-13b-iepile-lora/main' for available files.。 请确保--model_name_or_path设置的路径是 Baichuan2-13B-Chat 底座模型的路径,而非baichuan2-13b-iepile-lora权重的路径。为此你应该去https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/tree/main页面下载Baichuan2-13B-Chat模型,并放到IEPile的models目录下面,将baichuan2-13b-iepile-lora放到IEPile的lora目录下面。
错误显示OSError: model/baichuan2-13b-iepile-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/model/baichuan2-13b-iepile-lora/main' for available files.。 请确保--model_name_or_path设置的路径是 Baichuan2-13B-Chat 底座模型的路径,而非baichuan2-13b-iepile-lora权重的路径。为此你应该去https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/tree/main页面下载Baichuan2-13B-Chat模型,并放到IEPile的models目录下面,将baichuan2-13b-iepile-lora放到IEPile的lora目录下面。
好的,已下载模型,但是还是出现如下错误:(IEPile) root@autodl-container-59154cad17-09a5b617:~/autodl-tmp/IEPile# bash ft_scripts/fine_continue.bash
Traceback (most recent call last):
File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
03/02/2024 15:58:43 - WARNING - args.parser - ddp_find_unused_parameters
needs to be set as False for LoRA in DDP training.
[INFO|training_args.py:1299] 2024-03-02 15:58:43,375 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1713] 2024-03-02 15:58:43,377 >> PyTorch: setting up devices
/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py:1617: FutureWarning: --push_to_hub_token
is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token
instead.
warnings.warn(
03/02/2024 15:58:43 - INFO - args.parser - Process rank: 0, device: cuda:0, n_gpu: 1
distributed training: True, compute dtype: torch.bfloat16
03/02/2024 15:58:43 - INFO - args.parser - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=True,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=
Root Cause (first observed failure): [0]: time : 2024-03-02_15:58:48 host : autodl-container-59154cad17-09a5b617 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2713) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html请问如何解决
RuntimeError: CUDA error: invalid device ordinal可能是CUDA问题吧,CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py,有几个可用的GPU就填几个。
RuntimeError: CUDA error: invalid device ordinal可能是CUDA问题吧,CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py,有几个可用的GPU就填几个。
好的感谢,已解决,但是3090开启4bits量化,cuda out of memory了呜呜呜,是否两张3090够用
1、调低下面的参数--max_source_length 400,--cutoff_len 700,--max_target_length 300。2、调低--per_device_train_batch_size 2,--per_device_eval_batch_size 2,--gradient_accumulation_steps 4这些参数。3、对于baichuan2模型,如果出现在eval后保存时爆显存请设置 evaluation_strategy no
。
1、调低下面的参数--max_source_length 400,--cutoff_len 700,--max_target_length 300。2、调低--per_device_train_batch_size 2,--per_device_eval_batch_size 2,--gradient_accumulation_steps 4这些参数。3、对于baichuan2模型,如果出现在eval后保存时爆显存请设置
evaluation_strategy no
。
好的,已修改,运行fine_continue.bash 之后加载完模型,跑了出现了:Parameter 'function'=<function preprocess_dataset.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in
看看你训练文件中的一条数据的样式,请确保具有1、instruction字段和2、output字段
请问您的问题是否已解决?
训练或推理的时候,在脚本中添加“--bits 4”这一行参数即可