中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
The model's performance is poor when using the merged tokenizer. #540

Closed adam-mhd94 closed 2 months ago

adam-mhd94 commented 3 months ago

Check before submitting issues

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System


Describe your issue in detail

I intend to fine-tune the Lama 7 model with non-Chinese data. Training the model on large data with the original Lama tokenizer yields good results. However, when I use a tokenizer tailored for my language, the loss increases significantly, and the model performs very poorly. For example, it keeps repeating a single word or char.

GPUs: 6 16GB T4 I am training the model in a multi-GPU mode.


Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh) carefully before running the script

lr=2e-4 lora_rank=64 lora_alpha=128 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05

per_device_train_batch_size=1 gradient_accumulation_steps=1 block_size=32

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 torchrun --nnodes 1 --nproc_per_node 6 --master_port 5896 run_clm_pt_with_peft.py \ --deepspeed ${deepspeed_config_file} \ --model_name_or_path ${pretrained_model} \ --tokenizer_name_or_path ${pretrained_model} \ --dataset_dir ${dataset_dir} \ --data_cache_dir ${data_cache} \ --validation_split_percentage 0.001 \ --per_device_train_batch_size ${per_device_train_batch_size} \ --do_train \ --seed $RANDOM \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate ${lr} \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 2 \ --save_steps 200 \ --gradient_accumulation_steps ${gradient_accumulation_steps} \ --preprocessing_num_workers 16 \ --block_size ${block_size} \ --output_dir ${output_dir} \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank ${lora_rank} \ --lora_alpha ${lora_alpha} \ --trainable ${lora_trainable} \ --lora_dropout ${lora_dropout} \ --modules_to_save ${modules_to_save} \ --torch_dtype float32 \ --load_in_kbits 8 \ --save_safetensors False \ --gradient_checkpointing \ --ddp_find_unused_parameters False \

Dependencies (must be provided for code-related issues)

Execution logs or screenshots

The model's output is such that it continuously repeats a word and is completely meaningless. Do you know where the problem might be coming from?

iMountTai commented 3 months ago

There may be cases of underfitting.

adam-mhd94 commented 3 months ago

Thank you. Due to the 16GB memory(each GPU), I cannot increase the batch size. Could the issue possibly be due to a very small batch size?

