ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7k stars 570 forks source link

The model's performance is poor when using the merged tokenizer. #540

Closed adam-mhd94 closed 2 months ago

adam-mhd94 commented 3 months ago

Check before submitting issues

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System

Linux

Describe your issue in detail

I intend to fine-tune the Lama 7 model with non-Chinese data. Training the model on large data with the original Lama tokenizer yields good results. However, when I use a tokenizer tailored for my language, the loss increases significantly, and the model performs very poorly. For example, it keeps repeating a single word or char.

GPUs: 6 16GB T4 I am training the model in a multi-GPU mode.

运行脚本前请仔细阅读wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh)

Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh) carefully before running the script

lr=2e-4 lora_rank=64 lora_alpha=128 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05

per_device_train_batch_size=1 gradient_accumulation_steps=1 block_size=32

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 torchrun --nnodes 1 --nproc_per_node 6 --master_port 5896 run_clm_pt_with_peft.py \ --deepspeed ${deepspeed_config_file} \ --model_name_or_path ${pretrained_model} \ --tokenizer_name_or_path ${pretrained_model} \ --dataset_dir ${dataset_dir} \ --data_cache_dir ${data_cache} \ --validation_split_percentage 0.001 \ --per_device_train_batch_size ${per_device_train_batch_size} \ --do_train \ --seed $RANDOM \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate ${lr} \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 2 \ --save_steps 200 \ --gradient_accumulation_steps ${gradient_accumulation_steps} \ --preprocessing_num_workers 16 \ --block_size ${block_size} \ --output_dir ${output_dir} \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank ${lora_rank} \ --lora_alpha ${lora_alpha} \ --trainable ${lora_trainable} \ --lora_dropout ${lora_dropout} \ --modules_to_save ${modules_to_save} \ --torch_dtype float32 \ --load_in_kbits 8 \ --save_safetensors False \ --gradient_checkpointing \ --ddp_find_unused_parameters False \

Dependencies (must be provided for code-related issues)

accelerate==0.27.2 aiofiles==23.2.1 aiohttp==3.9.3 aiosignal==1.3.1 altair==5.2.0 anyio==4.3.0 appdirs==1.4.4 async-timeout==4.0.3 attrs==23.2.0 bitsandbytes==0.41.1 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 contourpy==1.2.0 cycler==0.12.1 datasets==2.14.5 deepspeed==0.11.0 dill==0.3.7 docker-pycreds==0.4.0 exceptiongroup==1.2.0 fastapi==0.109.2 ffmpy==0.3.2 filelock==3.13.1 fire==0.5.0 fonttools==4.49.0 frozenlist==1.4.1 fsspec==2023.6.0 gitdb==4.0.11 GitPython==3.1.42 gradio==3.50.2 gradio_client==0.6.1 h11==0.14.0 hjson==3.1.0 httpcore==1.0.3 httpx==0.26.0 huggingface-hub==0.17.3 idna==3.6 importlib-resources==6.1.1 Jinja2==3.1.2 joblib==1.3.2 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 kiwisolver==1.4.5 MarkupSafe==2.1.3 matplotlib==3.8.3 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.15 networkx==3.2.1 ninja==1.11.1.1 numpy==1.26.4 nvidia-cublas-cu11==11.11.3.6 nvidia-cuda-cupti-cu11==11.8.87 nvidia-cuda-nvrtc-cu11==11.8.89 nvidia-cuda-runtime-cu11==11.8.89 nvidia-cudnn-cu11==8.7.0.84 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.3.0.86 nvidia-cusolver-cu11==11.4.1.48 nvidia-cusparse-cu11==11.7.5.86 nvidia-nccl-cu11==2.19.3 nvidia-nvtx-cu11==11.8.86 orjson==3.9.14 packaging==23.2 pandas==2.2.0 pathtools==0.1.2 peft==0.3.0 pillow==10.2.0 protobuf==4.25.3 psutil==5.9.8 py-cpuinfo==9.0.0 pyarrow==15.0.0 pydantic==1.10.14 pydub==0.25.1 pyparsing==3.1.1 python-dateutil==2.8.2 python-multipart==0.0.9 pytz==2024.1 PyYAML==6.0.1 referencing==0.33.0 regex==2023.12.25 requests==2.31.0 rpds-py==0.18.0 safetensors==0.4.2 scikit-learn==1.4.1.post1 scipy==1.11.1 semantic-version==2.10.0 sentencepiece==0.1.99 sentry-sdk==1.40.5 setproctitle==1.3.3 six==1.16.0 smmap==5.0.1 sniffio==1.3.0 starlette==0.36.3 sympy==1.12 termcolor==2.4.0 threadpoolctl==3.3.0 tokenizers==0.14.1 toolz==0.12.1 torch==2.2.0+cu118 torchaudio==2.2.0+cu118 torchvision==0.17.0+cu118 tqdm==4.66.2 transformers==4.34.0 triton==2.2.0 typing_extensions==4.9.0 tzdata==2024.1 urllib3==2.2.1 uvicorn==0.27.1 wandb==0.15.12 websockets==11.0.3 xxhash==3.4.1 yarl==1.9.4

Execution logs or screenshots

The model's output is such that it continuously repeats a word and is completely meaningless. Do you know where the problem might be coming from?

iMountTai commented 3 months ago

There may be cases of underfitting.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

adam-mhd94 commented 3 months ago

There may be cases of underfitting.

Thank you. Due to the 16GB memory(each GPU), I cannot increase the batch size. Could the issue possibly be due to a very small batch size?

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 2 months ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.