Bert 训练未标记的中文数据相关

SakuraXiaMF commented 4 years ago

请问一下，我想无监督训练一些unlabeled的中文数据，请问有相关教程吗？我看了官网的，感觉官网的讲的不是很清楚。想请教一下

ymcui commented 4 years ago

我没太理解，可以再详细说一下吗？是说要用一些自由文本（无标注）做预训练吗？

SakuraXiaMF commented 4 years ago

谢谢您的回复。是的，我这有一批tsv的未标注的中文文本，想做预训练。想问下有什么教程吗？我看官方给example export TRAIN_FILE=/path/to/dataset/wiki.train.raw export TEST_FILE=/path/to/dataset/wiki.test.raw

python run_language_modeling.py \ --output_dir=output \ --model_type=roberta \ --model_name_or_path=roberta-base \ --do_train \ --train_data_file=$TRAIN_FILE \ --do_eval \ --eval_data_file=$TEST_FILE \ --mlm

SakuraXiaMF commented 4 years ago

我没太明白，要中文训练应该如何使用。我的中文模型是bert-base chinese 这个官方的模型，我想请问一下我要怎样设置这些参数？

ymcui commented 4 years ago

你上面列的例子应该是Huggingface提供的run_language_modeling.py脚本：https://huggingface.co/transformers/examples.html#roberta-bert-and-masked-language-modeling 完整的API：https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py#L483

或者使用RoBERTa官方脚本和教程：https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md

预训练主要需要设置学习率，batch大小，文本长度大小等参数。其余的可以先保持默认。由于本目录中的模型没有使用Huggingface的工具包，所以暂时无法提供教程。你可以去Huggingface的Transformers社区查找相关教程。

SakuraXiaMF commented 4 years ago

2333谢谢。那我思路没问题，就是参数设置有点毛病。谢谢。顺便问一句，他那个文件可以是.tsv文件吗？他们的脚本给的是.raw文件

SakuraXiaMF commented 4 years ago

谢谢您，我训起来了。 Iteration: 6%|████▋ | 578/9332 [05:30<1:12:23, 2.02 Iteration: 6%|████▋ | 579/9332 [05:30<1:13:12, 1.99 Iteration: 6%|████▋ | 580/9332 [05:31<1:14:41, 1.95 Iteration: 6%|████▋ | 581/9332 [05:31<1:14:35, 1.96 Iteration: 6%|████▋ | 582/9332 [05:32<1:14:20, 1.96 Iteration: 6%|████▋ | 583/9332 [05:32<1:14:25, 1.96 Iteration: 6%|████▊ | 584/9332 [05:33<1:13:17, 1.99 Iteration: 6%|████▊ | 585/9332 [05:33<1:14:26, 1.96 Iteration: 6%|████▊ | 586/9332 [05:34<1:14:46, 1.95 Iteration: 6%|████▊

2212168851 commented 4 years ago

是可以训练起来，早就可以训练了。关键是怎么把评估效果提升上来。我现在想加着评估数据------------------ 原始邮件 ------------------ 发件人: "SakuraXiaMF"notifications@github.com 发送时间: 2020年3月19日(星期四) 上午10:40 收件人: "ymcui/Chinese-BERT-wwm"Chinese-BERT-wwm@noreply.github.com; 抄送: "Subscribed"subscribed@noreply.github.com; 主题: Re: [ymcui/Chinese-BERT-wwm] Bert 训练未标记的中文数据相关 (#95)

谢谢您，我训起来了。 Iteration: 6%|████▋ | 578/9332 [05:30<1:12:23, 2.02 Iteration: 6%|████▋ | 579/9332 [05:30<1:13:12, 1.99 Iteration: 6%|████▋ | 580/9332 [05:31<1:14:41, 1.95 Iteration: 6%|████▋ | 581/9332 [05:31<1:14:35, 1.96 Iteration: 6%|████▋ | 582/9332 [05:32<1:14:20, 1.96 Iteration: 6%|████▋ | 583/9332 [05:32<1:14:25, 1.96 Iteration: 6%|████▊ | 584/9332 [05:33<1:13:17, 1.99 Iteration: 6%|████▊ | 585/9332 [05:33<1:14:26, 1.96 Iteration: 6%|████▊ | 586/9332 [05:34<1:14:46, 1.95 Iteration: 6%|████▊

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ymcui commented 4 years ago

如有其他问题可随时reopen。

ymcui / Chinese-BERT-wwm

Bert 训练未标记的中文数据相关 #95