Closed michael-wzhu closed 5 years ago
Some corpus is protected by copyright and this project owner has no right to release. For those public corpus, it is actually easy to obtain. You can search keywords 'Chinese corpus' on GitHub, or gather it by yourself.
Thank you for your clarification. @yaleimeng
I agree with that the large-scale training data with free access is important in future NLP research. However, the license issue is inevitable in reality. One thing that you should have noticed: you CAN NOT find ready-to-download large-scale Baike data but you will find a lot of spider programs. In this context, I'm afraid you have to use these spider programs for crawling the data by yourself. Sorry for the inconvenience that have caused.
As we all know, chinese NLP research has been slowed down by inavailability of large open-source corpus, and this issue has become more and more severe due to the recent advances of large pre-trained LMs. So could you make the training corpus open-source, for further research or followup works?