adam-mhd94 commented 8 months ago

Check before submitting issues

[X] Make sure to pull the latest code, as some issues and bugs have been fixed.
[X] I have read the Wiki and FAQ section AND searched for similar issues and did not find a similar problem or solution
[X] Third-party plugin issues - e.g., llama.cpp, LangChain, text-generation-webui, we recommend checking the corresponding project for solutions

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System

Linux

Describe your issue in detail

I intend to fine-tune the Lama 7 model with non-Chinese data. Training the model on large data with the original Lama tokenizer yields good results. However, when I use a tokenizer tailored for my language, the loss increases significantly, and the model performs very poorly. For example, it keeps repeating a single word or char.

GPUs: 6 16GB T4 I am training the model in a multi-GPU mode.

pt_scripts_zh)

Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh) carefully before running the script

lr=2e-4 lora_rank=64 lora_alpha=128 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05

per_device_train_batch_size=1 gradient_accumulation_steps=1 block_size=32

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 torchrun --nnodes 1 --nproc_per_node 6 --master_port 5896 run_clm_pt_with_peft.py \ --deepspeed ${deepspeed_config_file} \ --model_name_or_path ${pretrained_model} \ --tokenizer_name_or_path ${pretrained_model} \ --dataset_dir ${dataset_dir} \ --data_cache_dir ${data_cache} \ --validation_split_percentage 0.001 \ --per_device_train_batch_size ${per_device_train_batch_size} \ --do_train \ --seed $RANDOM \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate ${lr} \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 2 \ --save_steps 200 \ --gradient_accumulation_steps ${gradient_accumulation_steps} \ --preprocessing_num_workers 16 \ --block_size ${block_size} \ --output_dir ${output_dir} \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank ${lora_rank} \ --lora_alpha ${lora_alpha} \ --trainable ${lora_trainable} \ --lora_dropout ${lora_dropout} \ --modules_to_save ${modules_to_save} \ --torch_dtype float32 \ --load_in_kbits 8 \ --save_safetensors False \ --gradient_checkpointing \ --ddp_find_unused_parameters False \

Dependencies (must be provided for code-related issues)

accelerate==0.27.2 aiofiles==23.2.1 aiohttp==3.9.3 aiosignal==1.3.1 altair==5.2.0 anyio==4.3.0 appdirs==1.4.4 async-timeout==4.0.3 attrs==23.2.0 bitsandbytes==0.41.1 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 contourpy==1.2.0 cycler==0.12.1 datasets==2.14.5 deepspeed==0.11.0 dill==0.3.7 docker-pycreds==0.4.0 exceptiongroup==1.2.0 fastapi==0.109.2 ffmpy==0.3.2 filelock==3.13.1 fire==0.5.0 fonttools==4.49.0 frozenlist==1.4.1 fsspec==2023.6.0 gitdb==4.0.11 GitPython==3.1.42 gradio==3.50.2 gradio_client==0.6.1 h11==0.14.0 hjson==3.1.0 httpcore==1.0.3 httpx==0.26.0 huggingface-hub==0.17.3 idna==3.6 importlib-resources==6.1.1 Jinja2==3.1.2 joblib==1.3.2 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 kiwisolver==1.4.5 MarkupSafe==2.1.3 matplotlib==3.8.3 mpmath==1.3.0 multidict==6.0.5 multiprocess==0.70.15 networkx==3.2.1 ninja==1.11.1.1 numpy==1.26.4 nvidia-cublas-cu11==11.11.3.6 nvidia-cuda-cupti-cu11==11.8.87 nvidia-cuda-nvrtc-cu11==11.8.89 nvidia-cuda-runtime-cu11==11.8.89 nvidia-cudnn-cu11==8.7.0.84 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.3.0.86 nvidia-cusolver-cu11==11.4.1.48 nvidia-cusparse-cu11==11.7.5.86 nvidia-nccl-cu11==2.19.3 nvidia-nvtx-cu11==11.8.86 orjson==3.9.14 packaging==23.2 pandas==2.2.0 pathtools==0.1.2 peft==0.3.0 pillow==10.2.0 protobuf==4.25.3 psutil==5.9.8 py-cpuinfo==9.0.0 pyarrow==15.0.0 pydantic==1.10.14 pydub==0.25.1 pyparsing==3.1.1 python-dateutil==2.8.2 python-multipart==0.0.9 pytz==2024.1 PyYAML==6.0.1 referencing==0.33.0 regex==2023.12.25 requests==2.31.0 rpds-py==0.18.0 safetensors==0.4.2 scikit-learn==1.4.1.post1 scipy==1.11.1 semantic-version==2.10.0 sentencepiece==0.1.99 sentry-sdk==1.40.5 setproctitle==1.3.3 six==1.16.0 smmap==5.0.1 sniffio==1.3.0 starlette==0.36.3 sympy==1.12 termcolor==2.4.0 threadpoolctl==3.3.0 tokenizers==0.14.1 toolz==0.12.1 torch==2.2.0+cu118 torchaudio==2.2.0+cu118 torchvision==0.17.0+cu118 tqdm==4.66.2 transformers==4.34.0 triton==2.2.0 typing_extensions==4.9.0 tzdata==2024.1 urllib3==2.2.1 uvicorn==0.27.1 wandb==0.15.12 websockets==11.0.3 xxhash==3.4.1 yarl==1.9.4

Execution logs or screenshots

The model's output is such that it continuously repeats a word and is completely meaningless. Do you know where the problem might be coming from?

iMountTai commented 8 months ago

There may be cases of underfitting.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

adam-mhd94 commented 7 months ago