ymcui / Chinese-BERT-wwm

Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型)
https://ieeexplore.ieee.org/document/9599397
Apache License 2.0
9.66k stars 1.39k forks source link

请问你们大概训了多少轮 #28

Closed hzrpku closed 5 years ago

hzrpku commented 5 years ago

还有就是请问你们dupe_factor参数是默认的10吗,谢谢!

ymcui commented 5 years ago
  1. BERT-wwm-ext吗?训练第一阶段(最大长度为128)采用的batch size为2560,训练了1M步。训练第二阶段(最大长度为512)采用的batch size为384,训练了400K步。
  2. dupe_factor=5
hzrpku commented 5 years ago

请问第一个模型BERT-wwm的轮数epochs和dupe_factor参数呢?

ymcui commented 5 years ago

技术报告中有写。

We train 100K steps on the samples with a maximum length of 128, batch size of 2,560, an initial learning rate of 1e-4 (with warm-up ratio 10%). Then, we train another 100K steps on a maximum length of 512 with a batch size of 384 to learn the long-range dependencies and position embeddings. 

dupe_factor也是5

hzrpku commented 5 years ago

谢谢回答! 1.请问输入数据的量的大小有多少呢,就是纯文本,而非tf.record。 2.如果batch size无法达到那么大(2560),请问有什么好的建议吗?

ymcui commented 5 years ago
  1. 不带ext的就是中文维基百科训练,带ext的纯文本大小15~20G吧,我没详细统计。
  2. batch size开不大的话可以考虑用gradient accumulation做梯度累计。batch size太小的话的确会影响效果(已在BERT/XLNet等github中得到证实)。
hzrpku commented 5 years ago

太感谢了!谢谢