ymcui / Chinese-LLaMA-Alpaca

中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki
Apache License 2.0
18.23k stars 1.86k forks source link

请问增量预训练数据量有多少? #806

Closed peiyingxin closed 1 year ago

peiyingxin commented 1 year ago

提交前必须检查以下项目

问题类型

模型训练与精调

基础模型

LLaMA-7B

操作系统

Linux

详细描述问题

请问预训练数据有多少B的 token 呢? 项目中提到用了 120G 中文语料,120G 中文语料大概对应 30~40B token?但是看预训练配置计算的 token 数与这个有很大的出入:total tokens=10245126000=3.1B。

依赖情况(代码类问题务必提供)

# 请在此处粘贴依赖情况

运行日志或截图

# 请在此处粘贴运行日志
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions[bot] commented 1 year ago

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.