yaoxingcheng / TLM

ICML'2022: NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework
MIT License
257 stars 21 forks source link

如何下载CC-Stories语料 #11

Closed sunyilgdx closed 10 months ago

sunyilgdx commented 2 years ago

请问如何下载到CC-Stories语料库呢?

yaoxingcheng commented 2 years ago

We haven't found a publicly available version of STORIES yet. In our work, we follow the same methodology in the original paper of STORIES dataset (Section 5.3) to collect a set of documents with a similar size from the CommonCrawl corpus, and use that collected documents as part of our general corpus.