ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7.09k stars 578 forks source link

chinese-alpaca-2-13b-16k SFT後我覺得成功 但使用訓練集指令 依然不會跑出我要的結果 #312

Closed swilly0906 closed 1 year ago

swilly0906 commented 1 year ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Chinese-Alpaca-2-16K (7B/13B)

操作系统

Linux

详细描述问题

# 我上面的Linux其實是用Google colab環境執行
# 请在此处粘贴运行代码(请粘贴在本代码块里)
!pip install transformers==4.31.0
!pip install sentencepiece==0.1.97
!pip install bitsandbytes
!pip install accelerate
!pip install peft
!pip install torch
!pip install datasets

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import accelerate , bitsandbytes
model_name = "ziqingyang/chinese-alpaca-2-13b-16k"
cache_folder = "/content/model"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', torch_dtype=torch.float16, cache_dir=cache_folder)
model = model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, cache_dir=cache_folder)
tokenizer.pad_token = tokenizer.eos_token

########参数部分########
lr=1e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

pretrained_model="/content/model/chinese-alpaca-2-13b-16k"#path/to/hf/llama-2/or/merged/llama-2/dir/or/model_id
chinese_tokenizer_path="/content/model/chinese-alpaca-2-13b-16k"#path/to/chinese/llama-2/tokenizer/dir
dataset_dir="/content/train_folder/train_data_chat.json"#path/to/sft/data/dir
per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
output_dir="/content/output_folder/"#output_dir
#peft_model="/content/Chinese-LLaMA-Alpaca-2/scripts/training/peft/peft_model.py"#path/to/peft/model/dir
validation_file="/content/vaild_folder/valid_data_chat.json"#validation_file_name
max_seq_length=512

deepspeed_config_file="/content/Chinese-LLaMA-Alpaca-2/scripts/training/ds_zero2_no_offload.json"#ds_zero2_no_offload.json

########启动命令########
!python /content/Chinese-LLaMA-Alpaca-2/scripts/training/run_clm_sft_with_peft.py \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --do_eval \
    --seed $RANDOM \
    --fp16 \
    --num_train_epochs 2 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --evaluation_strategy steps \
    --eval_steps 250 \
    --save_steps 500 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir} \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --trainable ${lora_trainable} \
    --modules_to_save ${modules_to_save} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype float16 \
    --validation_file ${validation_file} \
    --load_in_kbits

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况(请粘贴在本代码块里)
# 下面提供的結果 這是我上面code的執行結果
# 我標題認為SFT沒問題 是因為colab 有成功執行 沒有跳出error不給執行

2023-09-24 04:39:00.973546: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT usage: run_clm_sft_with_peft.py [-h] [--model_name_or_path MODEL_NAME_OR_PATH] [--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH] [--config_overrides CONFIG_OVERRIDES] [--config_name CONFIG_NAME] [--tokenizer_name TOKENIZER_NAME] [--cache_dir CACHE_DIR] [--use_fast_tokenizer [USE_FAST_TOKENIZER]] [--no_use_fast_tokenizer] [--model_revision MODEL_REVISION] [--use_auth_token [USE_AUTH_TOKEN]] [--torch_dtype {auto,bfloat16,float16,float32}] [--dataset_dir DATASET_DIR] [--train_file TRAIN_FILE] [--validation_file VALIDATION_FILE] [--overwrite_cache [OVERWRITE_CACHE]] [--validation_split_percentage VALIDATION_SPLIT_PERCENTAGE] [--preprocessing_num_workers PREPROCESSING_NUM_WORKERS] [--keep_linebreaks [KEEP_LINEBREAKS]] [--no_keep_linebreaks] [--data_cache_dir DATA_CACHE_DIR] [--max_seq_length MAX_SEQ_LENGTH] --output_dir OUTPUT_DIR [--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]] [--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]] [--evaluation_strategy {no,steps,epoch}] [--prediction_loss_only [PREDICTION_LOSS_ONLY]] [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE] [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE] [--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE] [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS] [--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS] [--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}] [--warmup_ratio WARMUP_RATIO] [--warmup_steps WARMUP_STEPS] [--log_level {debug,info,warning,error,critical,passive}] [--log_level_replica {debug,info,warning,error,critical,passive}] [--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node] [--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}] [--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS] [--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter] [--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS] [--save_total_limit SAVE_TOTAL_LIMIT] [--save_safetensors [SAVE_SAFETENSORS]] [--save_on_each_node [SAVE_ON_EACH_NODE]] [--no_cuda [NO_CUDA]] [--use_mps_device [USE_MPS_DEVICE]] [--seed SEED] [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]] [--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]] [--fp16_opt_level FP16_OPT_LEVEL] [--half_precision_backend {auto,cuda_amp,apex,cpu_amp}] [--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]] [--tf32 TF32] [--local_rank LOCAL_RANK] [--ddp_backend {nccl,gloo,mpi,ccl}] [--tpu_num_cores TPU_NUM_CORES] [--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG [DEBUG ...]] [--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS] [--dataloader_num_workers DATALOADER_NUM_WORKERS] [--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM] [--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns] [--label_names LABEL_NAMES [LABEL_NAMES ...]] [--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]] [--metric_for_best_model METRIC_FOR_BEST_MODEL] [--greater_is_better GREATER_IS_BETTER] [--ignore_data_skip [IGNORE_DATA_SKIP]] [--sharded_ddp SHARDED_DDP] [--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS] [--fsdp_config FSDP_CONFIG] [--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP] [--deepspeed DEEPSPEED] [--label_smoothing_factor LABEL_SMOOTHING_FACTOR] [--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit}] [--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]] [--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME] [--report_to REPORT_TO [REPORT_TO ...]] [--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS] [--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB] [--ddp_broadcast_buffers DDP_BROADCAST_BUFFERS] [--dataloader_pin_memory [DATALOADER_PIN_MEMORY]] [--no_dataloader_pin_memory] [--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics] [--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]] [--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--hub_model_id HUB_MODEL_ID] [--hub_strategy {end,every_save,checkpoint,all_checkpoints}] [--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]] [--gradient_checkpointing [GRADIENT_CHECKPOINTING]] [--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]] [--fp16_backend {auto,cuda_amp,apex,cpu_amp}] [--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID] [--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION] [--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS] [--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]] [--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO] [--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT] [--torch_compile [TORCH_COMPILE]] [--torch_compile_backend TORCH_COMPILE_BACKEND] [--torch_compile_mode TORCH_COMPILE_MODE] [--xpu_backend {mpi,ccl,gloo}] [--trainable TRAINABLE] [--lora_rank LORA_RANK] [--lora_dropout LORA_DROPOUT] [--lora_alpha LORA_ALPHA] [--modules_to_save MODULES_TO_SAVE] [--peft_path PEFT_PATH] [--flash_attn [FLASH_ATTN]] [--double_quant [DOUBLE_QUANT]] [--no_double_quant] [--quant_type QUANT_TYPE] [--load_in_kbits LOAD_IN_KBITS] run_clm_sft_with_peft.py: error: argument --model_name_or_path: expected one argument

运行日志或截图

# 请在此处粘贴运行日志(请粘贴在本代码块里)

抱歉 本身非CS背景 已經看很多issue但真的還是找不找問題 我的訓練集資料train_data_chat.json 是照ymcui大大提供的格式 裡面內容大致上如下: [ {"instruction" : "You are a helpful assistant. 你是Willy行銷公司資料庫的MYSQL工程師。", "input" : "請用SQL查詢消費者在日本,且出生日期為1994年5月23日,或1990年12月11日" , "output" : "SELECT CUSTOM,BIRTH FROM taiwan_willy_tr20 where birth in ('19940523','19901211') and loc='03'" }, ... ] 我裡面的檔案大概20個指令集 我的Task大多為SQL 只有第一個intput/output是taiwan_willy_tr20這張table的schema 其餘的指令 大多像我上面的內容 驗證資料valid_data_chat 也大概20個指令集 (難道我是指令太少 導致模型不甩我的指令?) 最後我執行:

input_ids = tokenizer(['You are a helpful assistant. 你是Willy行銷公司資料庫的MYSQL工程師。請用SQL查詢消費者在日本,且出生日期為1994年5月23日,或1990年12月11日 '], return_tensors="pt",add_special_tokens=False).input_ids.to('cuda')
generate_input = {
    "input_ids":input_ids,
    "max_new_tokens":512,
    "do_sample":True,
    "top_k":50,
    "top_p":0.95,
    "temperature":0.3,
    "repetition_penalty":1.3,
    "eos_token_id":tokenizer.eos_token_id,
    "bos_token_id":tokenizer.bos_token_id,
    "pad_token_id":tokenizer.pad_token_id
}
generate_ids = model.generate(**generate_input)
text = tokenizer.decode(generate_ids[0])
print(text)

出來的結果 就是model幻想的SQL指令 有產生SQL但table名稱完全不對 所以我不知道前面的指令 哪邊有問題?

還是說我一定下載Chinese-Alpaca-2-LoRA-13B-16K檔案進行調參才可以? 因為我也疑惑為什麼issue的人很多都用LoRA調完才合併 像我直接用全模型SFT好像幾乎沒人這樣做?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 1 year ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.