guihonghao commented 8 months ago

训练或推理的时候，在脚本中添加“--bits 4”这一行参数即可

vv521 commented 8 months ago

训练或推理的时候，在脚本中添加“--bits 4”这一行参数即可

好的，bitsandbytes库是不是只能在Linux上运行呀，Windows老是报错呀。

guihonghao commented 8 months ago

bitsandbytes库可能在Windows上兼容性不好，我们是在Linux上运行。

vv521 commented 8 months ago

bitsandbytes库可能在Windows上兼容性不好，我们是在Linux上运行。好的，在Linux运行fine_continue.bash出现如下错误：===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

issues

bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8888/jupyter'), PosixPath('http'), PosixPath('//autodl-container-2a5049addd-221c2cd9')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('https'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/3/error.json')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8888/jupyter'), PosixPath('http'), PosixPath('//autodl-container-2a5049addd-221c2cd9')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com'), PosixPath('https')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/1/error.json')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... bin /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib'), PosixPath('/usr/local/nvidia/lib64')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('http'), PosixPath('8888/jupyter'), PosixPath('//autodl-container-2a5049addd-221c2cd9')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('8443'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/2/error.json')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/miniconda3/envs/IEPile did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8888/jupyter'), PosixPath('http'), PosixPath('//autodl-container-2a5049addd-221c2cd9')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('https'), PosixPath('//u206495-addd-221c2cd9.bjb1.seetacloud.com')} warn(msg) /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_qmtkanw4/none_m5rfu1ql/attempt_0/0/error.json')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths... /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward. Either way, this might cause trouble in the future: If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env. warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... 03/01/2024 21:10:31 - WARNING - args.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. [INFO|training_args.py:1299] 2024-03-01 21:10:31,973 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors! [INFO|training_args.py:1713] 2024-03-01 21:10:31,973 >> PyTorch: setting up devices /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py:1617: FutureWarning: --push_to_hub_token is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token instead. warnings.warn( 03/01/2024 21:10:31 - INFO - args.parser - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, compute dtype: torch.bfloat16 03/01/2024 21:10:31 - INFO - args.parser - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=lora/baichuan2-13b-iepile-contiune-v1/runs/Mar01_21-10-31_autodl-container-2a5049addd-221c2cd9, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=2, logging_strategy=steps, loss_scale=1.0, lr_scheduler_type=linear, max_grad_norm=0.5, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, output_dir=lora/baichuan2-13b-iepile-contiune-v1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=2, per_device_train_batch_size=2, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=[], resume_from_checkpoint=None, run_name=lora/baichuan2-13b-iepile-contiune-v1, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=epoch, save_total_limit=10, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 03/01/2024 21:10:31 - INFO - main - Start Time: 2024:03:01 21:10:31 03/01/2024 21:10:31 - INFO - main - model_args:ModelArguments(model_name_or_path='model/baichuan2-13b-iepile-lora', model_name='baichuan', cache_dir=None, use_fast_tokenizer=True, trust_remote_code=True, use_auth_token=False, model_revision='main', split_special_tokens=False, bits=4, adam8bit=False, double_quant=True, quant_type='nf4', checkpoint_dir=['model/baichuan2-13b-iepile-lora']) data_args:DataArguments(train_file='data/kuangshan-re/train.json', valid_file='data/kuangshan-re/test.json', predict_file=None, preprocessing_num_workers=16, overwrite_cache=False, cache_path=None, template='baichuan2', system_prompt=None, max_source_length=400, max_target_length=300, cutoff_len=700, val_set_size=1000, pad_to_max_length=False, ignore_pad_token_for_loss=True, train_on_prompt=False, language='zh', id_text='input') training_args:TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=lora/baichuan2-13b-iepile-contiune-v1/runs/Mar01_21-10-31_autodl-container-2a5049addd-221c2cd9, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=2, logging_strategy=steps, loss_scale=1.0, lr_scheduler_type=linear, max_grad_norm=0.5, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, output_dir=lora/baichuan2-13b-iepile-contiune-v1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=2, per_device_train_batch_size=2, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=[], resume_from_checkpoint=None, run_name=lora/baichuan2-13b-iepile-contiune-v1, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=epoch, save_total_limit=10, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) finetuning_args:FinetuningArguments(dpo_beta=0.1, ppo_logger=None, ppo_score_norm=False, ppo_target=6.0, ppo_whiten_rewards=False, ref_model=None, ref_model_checkpoint=None, ref_model_quantization_bit=None, reward_model=None, reward_model_checkpoint=None, reward_model_quantization_bit=None, reward_model_type='lora', lora_r=64, lora_alpha=64.0, lora_dropout=0.05, lora_target_modules=['W_pack', 'o_proj', 'gate_proj', 'down_proj', 'up_proj'], additional_target=None, resume_lora_training=True, num_layer_trainable=3, name_module_trainable=['mlp'], stage='sft', finetuning_type='lora', upcast_layernorm=False, neft_alpha=0, export_dir=None, plot_loss=False) generating_args:GenerationArguments(max_length=512, max_new_tokens=256, min_new_tokens=None, do_sample=False, num_beams=1, num_beam_groups=1, penalty_alpha=None, use_cache=True, temperature=1.0, top_k=50, top_p=1.0, typical_p=1.0, diversity_penalty=0.0, repetition_penalty=1.0, length_penalty=1.0, no_repeat_ngram_size=0) 03/01/2024 21:10:31 - INFO - main - model_class:<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> tokenizer_class:<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'> trainer_class:<class 'transformers.trainer.Trainer'>

[INFO|tokenization_auto.py:512] 2024-03-01 21:10:31,975 >> Could not locate the tokenizer configuration file, will try to use the model config instead. Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in main() File "/root/autodl-tmp/IEPile/src/finetune.py", line 110, in main train(model_args, data_args, training_args, finetuning_args, generating_args) File "/root/autodl-tmp/IEPile/src/finetune.py", line 37, in train Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in model, tokenizer = load_model_and_tokenizer( main() File "/root/autodl-tmp/IEPile/src/model/loader.py", line 53, in load_model_and_tokenizer

File "/root/autodl-tmp/IEPile/src/finetune.py", line 97, in main model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 65, in get_train_args tokenizer = tokenizer_class.from_pretrained( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 667, in from_pretrained model_args, data_args, training_args, finetuning_args, generating_args = parse_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 56, in parse_train_args return parse_args(parser, args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 42, in parse_args return parser.parse_args_into_dataclasses() File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses config = AutoConfig.from_pretrained( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 983, in from_pretrained obj = dtype(inputs) File "", line 118, in init File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1372, in __post_init__ config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, kwargs) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/configuration_utils.py", line 617, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/configuration_utils.py", line 672, in _get_config_dict and (self.device.type != "cuda") File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1795, in device resolved_config_file = cached_file( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/hub.py", line 388, in cached_file raise EnvironmentError( OSError: model/baichuan2-13b-iepile-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/model/baichuan2-13b-iepile-lora/main' for available files. return self._setup_devices File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1739, in _setup_devices Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in main() File "/root/autodl-tmp/IEPile/src/finetune.py", line 97, in main model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 65, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = parse_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 56, in parse_train_args return parse_args(parser, args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 42, in parse_args return parser.parse_args_into_dataclasses() File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(inputs)self.distributed_state = PartialState(

File "", line 118, in init File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/state.py", line 198, in init File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1372, in post_init__ torch.cuda.set_device(self.device) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/cuda/init__.py", line 350, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

and (self.device.type != "cuda")

File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1795, in device return self._setup_devices File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1739, in _setup_devices Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in main() File "/root/autodl-tmp/IEPile/src/finetune.py", line 97, in main model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 65, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = parse_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 56, in parse_train_args self.distributed_state = PartialState( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/state.py", line 198, in init return parse_args(parser, args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 42, in parse_args return parser.parse_args_into_dataclasses() File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses torch.cuda.set_device(self.device) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/cuda/init.py", line 350, in set_device obj = dtype(**inputs) File "", line 118, in init torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1372, in post_init__ and (self.device.type != "cuda") File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1795, in device return self._setup_devices File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1739, in _setup_devices self.distributed_state = PartialState( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/state.py", line 198, in init__ torch.cuda.set_device(self.device) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/cuda/init.py", line 350, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4229) of binary: /root/miniconda3/envs/IEPile/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/IEPile/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/finetune.py FAILED

Failures: [1]: time : 2024-03-01_21:10:34 host : autodl-container-2a5049addd-221c2cd9 rank : 1 (local_rank: 1) exitcode : 1 (pid: 4230) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-01_21:10:34 host : autodl-container-2a5049addd-221c2cd9 rank : 2 (local_rank: 2) exitcode : 1 (pid: 4231) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-01_21:10:34 host : autodl-container-2a5049addd-221c2cd9 rank : 3 (local_rank: 3) exitcode : 1 (pid: 4232) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-01_21:10:34 host : autodl-container-2a5049addd-221c2cd9 rank : 0 (local_rank: 0) exitcode : 1 (pid: 4229) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================请问如何解决呀

guihonghao commented 8 months ago

错误显示OSError: model/baichuan2-13b-iepile-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/model/baichuan2-13b-iepile-lora/main' for available files.。请确保--model_name_or_path设置的路径是 Baichuan2-13B-Chat 底座模型的路径，而非baichuan2-13b-iepile-lora权重的路径。为此你应该去https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/tree/main页面下载Baichuan2-13B-Chat模型，并放到IEPile的models目录下面，将baichuan2-13b-iepile-lora放到IEPile的lora目录下面。

vv521 commented 8 months ago

错误显示OSError: model/baichuan2-13b-iepile-lora does not appear to have a file named config.json. Checkout 'https://huggingface.co/model/baichuan2-13b-iepile-lora/main' for available files.。请确保--model_name_or_path设置的路径是 Baichuan2-13B-Chat 底座模型的路径，而非baichuan2-13b-iepile-lora权重的路径。为此你应该去https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/tree/main页面下载Baichuan2-13B-Chat模型，并放到IEPile的models目录下面，将baichuan2-13b-iepile-lora放到IEPile的lora目录下面。

好的，已下载模型，但是还是出现如下错误：(IEPile) root@autodl-container-59154cad17-09a5b617:~/autodl-tmp/IEPile# bash ft_scripts/fine_continue.bash Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in main() File "/root/autodl-tmp/IEPile/src/finetune.py", line 97, in main model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 65, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = parse_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 56, in parse_train_args return parse_args(parser, args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 42, in parse_args return parser.parse_args_into_dataclasses() File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 118, in init File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1372, in post_init__ and (self.device.type != "cuda") File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1795, in device return self._setup_devices File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1739, in _setup_devices self.distributed_state = PartialState( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/state.py", line 198, in init__ torch.cuda.set_device(self.device) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/cuda/init.py", line 350, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in main() File "/root/autodl-tmp/IEPile/src/finetune.py", line 97, in main model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 65, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = parse_train_args(args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 56, in parse_train_args return parse_args(parser, args) File "/root/autodl-tmp/IEPile/src/args/parser.py", line 42, in parse_args return parser.parse_args_into_dataclasses() File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 118, in init File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1372, in post_init__ and (self.device.type != "cuda") File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1795, in device return self._setup_devices File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/utils/generic.py", line 54, in get cached = self.fget(obj) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py", line 1739, in _setup_devices self.distributed_state = PartialState( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/accelerate/state.py", line 198, in init__ torch.cuda.set_device(self.device) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/cuda/init.py", line 350, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

03/02/2024 15:58:43 - WARNING - args.parser - ddp_find_unused_parameters needs to be set as False for LoRA in DDP training. [INFO|training_args.py:1299] 2024-03-02 15:58:43,375 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors! [INFO|training_args.py:1713] 2024-03-02 15:58:43,377 >> PyTorch: setting up devices /root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/training_args.py:1617: FutureWarning: --push_to_hub_token is deprecated and will be removed in version 5 of 🤗 Transformers. Use --hub_token instead. warnings.warn( 03/02/2024 15:58:43 - INFO - args.parser - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, compute dtype: torch.bfloat16 03/02/2024 15:58:43 - INFO - args.parser - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=lora/baichuan2-13b-chat-v1-continue/runs/Mar02_15-58-43_autodl-container-59154cad17-09a5b617, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=2, logging_strategy=steps, loss_scale=1.0, lr_scheduler_type=linear, max_grad_norm=0.5, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, output_dir=lora/baichuan2-13b-chat-v1-continue, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=2, per_device_train_batch_size=2, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=[], resume_from_checkpoint=None, run_name=lora/baichuan2-13b-chat-v1-continue, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=epoch, save_total_limit=10, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 03/02/2024 15:58:43 - INFO - main - Start Time: 2024:03:02 15:58:43 03/02/2024 15:58:43 - INFO - main - model_args:ModelArguments(model_name_or_path='models/Baichuan-13B-Chat', model_name='baichuan', cache_dir=None, use_fast_tokenizer=True, trust_remote_code=True, use_auth_token=False, model_revision='main', split_special_tokens=False, bits=4, adam8bit=False, double_quant=True, quant_type='nf4', checkpoint_dir=['lora/baichuan2-13b-iepile-lora']) data_args:DataArguments(train_file='data/kuangshan-re/train.json', valid_file='data/kuangshan-re/test.json', predict_file=None, preprocessing_num_workers=16, overwrite_cache=False, cache_path=None, template='baichuan2', system_prompt=None, max_source_length=400, max_target_length=300, cutoff_len=700, val_set_size=1000, pad_to_max_length=False, ignore_pad_token_for_loss=True, train_on_prompt=False, language='zh', id_text='input') training_args:TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=False, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=epoch, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=4, gradient_checkpointing=False, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=lora/baichuan2-13b-chat-v1-continue/runs/Mar02_15-58-43_autodl-container-59154cad17-09a5b617, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=2, logging_strategy=steps, loss_scale=1.0, lr_scheduler_type=linear, max_grad_norm=0.5, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, output_dir=lora/baichuan2-13b-chat-v1-continue, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=2, per_device_train_batch_size=2, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=False, report_to=[], resume_from_checkpoint=None, run_name=lora/baichuan2-13b-chat-v1-continue, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=epoch, save_total_limit=10, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) finetuning_args:FinetuningArguments(dpo_beta=0.1, ppo_logger=None, ppo_score_norm=False, ppo_target=6.0, ppo_whiten_rewards=False, ref_model=None, ref_model_checkpoint=None, ref_model_quantization_bit=None, reward_model=None, reward_model_checkpoint=None, reward_model_quantization_bit=None, reward_model_type='lora', lora_r=64, lora_alpha=64.0, lora_dropout=0.05, lora_target_modules=['W_pack', 'o_proj', 'gate_proj', 'down_proj', 'up_proj'], additional_target=None, resume_lora_training=True, num_layer_trainable=3, name_module_trainable=['mlp'], stage='sft', finetuning_type='lora', upcast_layernorm=False, neft_alpha=0, export_dir=None, plot_loss=False) generating_args:GenerationArguments(max_length=512, max_new_tokens=256, min_new_tokens=None, do_sample=False, num_beams=1, num_beam_groups=1, penalty_alpha=None, use_cache=True, temperature=1.0, top_k=50, top_p=1.0, typical_p=1.0, diversity_penalty=0.0, repetition_penalty=1.0, length_penalty=1.0, no_repeat_ngram_size=0) 03/02/2024 15:58:43 - INFO - main - model_class:<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'> tokenizer_class:<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'> trainer_class:<class 'transformers.trainer.Trainer'>

[INFO|tokenization_utils_base.py:1837] 2024-03-02 15:58:43,405 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:1837] 2024-03-02 15:58:43,406 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1837] 2024-03-02 15:58:43,406 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1837] 2024-03-02 15:58:43,406 >> loading file tokenizer_config.json Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in main() File "/root/autodl-tmp/IEPile/src/finetune.py", line 110, in main train(model_args, data_args, training_args, finetuning_args, generating_args) File "/root/autodl-tmp/IEPile/src/finetune.py", line 37, in train model, tokenizer = load_model_and_tokenizer( File "/root/autodl-tmp/IEPile/src/model/loader.py", line 53, in load_model_and_tokenizer tokenizer = tokenizer_class.from_pretrained( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 689, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, inputs, kwargs) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1841, in from_pretrained return cls._from_pretrained( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2004, in _from_pretrained tokenizer = cls(init_inputs, *init_kwargs) File "/root/.cache/huggingface/modules/transformers_modules/Baichuan-13B-Chat/tokenization_baichuan.py", line 59, in init self.sp_model.Load(vocab_file) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/sentencepiece/init.py", line 905, in Load return self.LoadFromFile(model_file) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/sentencepiece/init.py", line 310, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2713) of binary: /root/miniconda3/envs/IEPile/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/IEPile/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')()) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/finetune.py FAILED

Failures: [1]: time : 2024-03-02_15:58:48 host : autodl-container-59154cad17-09a5b617 rank : 1 (local_rank: 1) exitcode : 1 (pid: 2714) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-03-02_15:58:48 host : autodl-container-59154cad17-09a5b617 rank : 2 (local_rank: 2) exitcode : 1 (pid: 2715) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2024-03-02_15:58:48 host : autodl-container-59154cad17-09a5b617 rank : 3 (local_rank: 3) exitcode : 1 (pid: 2731) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-03-02_15:58:48 host : autodl-container-59154cad17-09a5b617 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2713) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html请问如何解决

guihonghao commented 8 months ago

RuntimeError: CUDA error: invalid device ordinal可能是CUDA问题吧，CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py，有几个可用的GPU就填几个。

vv521 commented 8 months ago

RuntimeError: CUDA error: invalid device ordinal可能是CUDA问题吧，CUDA_VISIBLE_DEVICES="0,1,2,3" torchrun --nproc_per_node=4 --master_port=1287 src/finetune.py，有几个可用的GPU就填几个。

好的感谢，已解决，但是3090开启4bits量化，cuda out of memory了呜呜呜，是否两张3090够用

guihonghao commented 8 months ago

1、调低下面的参数--max_source_length 400，--cutoff_len 700，--max_target_length 300。2、调低--per_device_train_batch_size 2，--per_device_eval_batch_size 2，--gradient_accumulation_steps 4这些参数。3、对于baichuan2模型，如果出现在eval后保存时爆显存请设置 evaluation_strategy no。

vv521 commented 8 months ago

1、调低下面的参数--max_source_length 400，--cutoff_len 700，--max_target_length 300。2、调低--per_device_train_batch_size 2，--per_device_eval_batch_size 2，--gradient_accumulation_steps 4这些参数。3、对于baichuan2模型，如果出现在eval后保存时爆显存请设置 evaluation_strategy no。

好的，已修改，运行fine_continue.bash 之后加载完模型，跑了出现了：Parameter 'function'=<function preprocess_dataset..preprocess_supervised_dataset at 0x7eff5fbc5d30> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed. multiprocess.pool.RemoteTraceback: """ Traceback (most recent call last): File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/multiprocess/pool.py", line 125, in worker result = (True, func(*args, kwds)) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 185, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/datasets/fingerprint.py", line 397, in wrapper out = func(self, args, kwargs) File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2006, in _map_single batch = apply_function_on_filtered_inputs( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1896, in apply_function_on_filtered_inputs function(*fn_args, effective_indices, *fn_kwargs) if with_indices else function(fn_args, **fn_kwargs) File "/root/autodl-tmp/IEPile/src/datamodule/preprocess.py", line 52, in preprocess_supervised_dataset for query, response, history, system in construct_example(examples): File "/root/autodl-tmp/IEPile/src/datamodule/preprocess.py", line 28, in construct_example query, response = examples["prompt"][i], examples["response"][i] KeyError: 'response' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/root/autodl-tmp/IEPile/src/finetune.py", line 115, in main() File "/root/autodl-tmp/IEPile/src/finetune.py", line 110, in main train(model_args, data_args, training_args, finetuning_args, generating_args) File "/root/autodl-tmp/IEPile/src/finetune.py", line 49, in train train_data, valid_data = process_datasets( File "/root/autodl-tmp/IEPile/src/datamodule/get_datasets.py", line 46, in process_datasets valid_data = preprocess_dataset(valid_data, tokenizer, data_args, training_args, stage=finetuning_args.stage) File "/root/autodl-tmp/IEPile/src/datamodule/preprocess.py", line 184, in preprocess_dataset dataset = dataset.map( File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1736, in map transformed_shards = [r.get() for r in results] File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1736, in transformed_shards = [r.get() for r in results] File "/root/miniconda3/envs/IEPile/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get raise self._value KeyError: 'response'这个错误请问如何解决

guihonghao commented 8 months ago

看看你训练文件中的一条数据的样式，请确保具有1、instruction字段和2、output字段

zxlzr commented 8 months ago

请问您的问题是否已解决？

zjunlp / IEPile

请问如何对biachuan2进行4比特量化，具体步骤怎么操作呀 #1

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

src/finetune.py FAILED

src/finetune.py FAILED