Will the corpus for training be open-sourced?

ymcui / Chinese-XLNet

Pre-Trained Chinese XLNet（中文XLNet预训练模型）

http://xlnet.hfl-rc.com

Apache License 2.0

1.65k stars 280 forks source link

Will the corpus for training be open-sourced? #6

Closed michael-wzhu closed 5 years ago

michael-wzhu commented 5 years ago

As we all know, chinese NLP research has been slowed down by inavailability of large open-source corpus, and this issue has become more and more severe due to the recent advances of large pre-trained LMs. So could you make the training corpus open-source, for further research or followup works?

yaleimeng commented 5 years ago

Some corpus is protected by copyright and this project owner has no right to release. For those public corpus, it is actually easy to obtain. You can search keywords 'Chinese corpus' on GitHub, or gather it by yourself.

ymcui commented 5 years ago

Thank you for your clarification. @yaleimeng

I agree with that the large-scale training data with free access is important in future NLP research. However, the license issue is inevitable in reality. One thing that you should have noticed: you CAN NOT find ready-to-download large-scale Baike data but you will find a lot of spider programs. In this context, I'm afraid you have to use these spider programs for crawling the data by yourself. Sorry for the inconvenience that have caused.