Do you provide the complete corpus used for the pre-training?

yya518 / FinBERT

A Pretrained BERT Model for Financial Communications. https://arxiv.org/abs/2006.08097

Apache License 2.0

569 stars 130 forks source link

Do you provide the complete corpus used for the pre-training? #20

Closed seyoseyoseyo closed 3 years ago

seyoseyoseyo commented 3 years ago

Interested in whether you would provide the 3 pieces of corpora used for pretraining FinBERT. Thanks in advance:)

Urmish commented 3 years ago

Hi! I would also be interested in that. It will be great if you can provide the scripts you used to download the pretraining corpus.

yya518 commented 3 years ago

We don't own those datasets so we cannot provide them. However, you can download most of them from public source. For example, for corporate 10K/10Q reports, you can download them from SEC website https://www.sec.gov/edgar/searchedgar/companysearch.html

VenkatKS commented 3 years ago

Hi @yya518, would it be possible to release the processing scripts you used to shard/tokenize the dataset? I can download the various datasets individually myself, but it would be great if I can get your methodology for processing the data (via scripts). Thank you so much!