ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Apache License 2.0
7.09k stars 578 forks source link

加载数据集时卡住,是什么原因 #365

Closed clclclaiggg closed 12 months ago

clclclaiggg commented 1 year ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

Chinese-LLaMA-2 (7B/13B)

操作系统

Linux

详细描述问题

[INFO|tokenization_utils_base.py:1837] 2023-10-24 14:28:08,190 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:1837] 2023-10-24 14:28:08,190 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1837] 2023-10-24 14:28:08,190 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1837] 2023-10-24 14:28:08,190 >> loading file tokenizer_config.json Using custom data configuration default-95ec87dea5b633cd 10/24/2023 14:28:09 - INFO - datasets.builder - Using custom data configuration default-95ec87dea5b633cd Loading Dataset Infos from /data/chenlong/enter/envs/chenllama/lib/python3.10/site-packages/datasets/packaged_modules/text 10/24/2023 14:28:09 - INFO - datasets.info - Loading Dataset Infos from /data/chenlong/enter/envs/chenllama/lib/python3.10/site-packages/datasets/packaged_modules/text 卡在这里不动了,请问是什么原因

依赖情况(代码类问题务必提供)

No response

运行日志或截图

No response

iMountTai commented 1 year ago

重提试试

clclclaiggg commented 1 year ago

重提试试

还是不行,sft也会卡住

iMountTai commented 1 year ago

几卡,数据集多大,cache有正常生成吗?另外,贴一下脚本命令,tokenizer如果使用fast-tokenizer可能也会出现这个问题

clclclaiggg commented 1 year ago

用的测试的数据集,只有25k. 命令是

Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh) carefully before running the script

lr=2e-4 lora_rank=64 lora_alpha=128 lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj" modules_to_save="embed_tokens,lm_head" lora_dropout=0.05

pretrained_model=path/to/hf/llama-2/dir chinese_tokenizer_path=path/to/chinese-llama-2/tokenizer/dir dataset_dir=path/to/pt/data/dir data_cache=temp_data_cache_dir per_device_train_batch_size=1 gradient_accumulation_steps=8 block_size=256 output_dir=output_dir

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_pt_with_peft.py \ --deepspeed ${deepspeed_config_file} \ --model_name_or_path /data/chenlong/LLaMA-Efficient-Tuning-main/models/7B-chat/ \ --tokenizer_name_or_path /data/chenlong/LLaMA-Efficient-Tuning-main/models/7B-chat/ \ --dataset_dir /data/chenlong/Chinese-LLaMA-Alpaca-2-main/data1/ \ --data_cache_dir /data/chenlong/Chinese-LLaMA-Alpaca-2-main/output/ \ --validation_split_percentage 0.001 \ --per_device_train_batch_size ${per_device_train_batch_size} \ --do_train \ --seed $RANDOM \ --fp16 \ --num_train_epochs 1 \ --lr_scheduler_type cosine \ --learning_rate ${lr} \ --warmup_ratio 0.05 \ --weight_decay 0.01 \ --logging_strategy steps \ --logging_steps 10 \ --save_strategy steps \ --save_total_limit 3 \ --save_steps 200 \ --gradient_accumulation_steps ${gradient_accumulation_steps} \ --preprocessing_num_workers 8 \ --block_size ${block_size} \ --output_dir /data/chenlong/Chinese-LLaMA-Alpaca-2-main/output1/ \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --lora_rank ${lora_rank} \ --lora_alpha ${lora_alpha} \ --trainable ${lora_trainable} \ --lora_dropout ${lora_dropout} \ --modules_to_save ${modules_to_save} \ --torch_dtype float16 \ --load_in_kbits 16 \ --gradient_checkpointing \ --ddp_find_unused_parameters False

clclclaiggg commented 1 year ago

data_cache_dir

几卡,数据集多大,cache有正常生成吗?另外,贴一下脚本命令,tokenizer如果使用fast-tokenizer可能也会出现这个问题

data_cache_dir生成的文件是空的

iMountTai commented 1 year ago

data_cache_dir设置没用,cache会直接在数据所在文件夹下生成。我这边未复现您的问题,再调试一下吧

github-actions[bot] commented 12 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

dehaozhou commented 8 months ago

data_cache_dir设置没用,cache会直接在数据所在文件夹下生成。我这边未复现您的问题,再调试一下吧

您好,请问您解决这个问题了吗?我和您碰到了同样的问题

clclclaiggg commented 8 months ago

data_cache_dir没用,缓存会直接数据在所在文件夹下现生成。我布拉格未复述您的问题,再调试一下吧

您好,请问您解决了这个问题吗?我和您遇到了同样的问题

没有解决,换项目了

dehaozhou commented 8 months ago

好的,感谢您的回复

Original Email

Sender:"long chen"< @.*** >;

Sent Time:2024/3/7 15:58

To:"ymcui/Chinese-LLaMA-Alpaca-2"< @.*** >;

Cc recipient:"dehaozhou"< @. >;"Comment"< @. >;

Subject:Re: [ymcui/Chinese-LLaMA-Alpaca-2] 加载数据集时卡住,是什么原因 (Issue #365)

data_cache_dir没用,缓存会直接数据在所在文件夹下现生成。我布拉格未复述您的问题,再调试一下吧

您好,请问您解决了这个问题吗?我和您遇到了同样的问题

没有解决,换项目了

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>