Closed seyoseyoseyo closed 3 years ago
Hi! I would also be interested in that. It will be great if you can provide the scripts you used to download the pretraining corpus.
We don't own those datasets so we cannot provide them. However, you can download most of them from public source. For example, for corporate 10K/10Q reports, you can download them from SEC website https://www.sec.gov/edgar/searchedgar/companysearch.html
Hi @yya518, would it be possible to release the processing scripts you used to shard/tokenize the dataset? I can download the various datasets individually myself, but it would be great if I can get your methodology for processing the data (via scripts). Thank you so much!
Interested in whether you would provide the 3 pieces of corpora used for pretraining FinBERT. Thanks in advance:)