Open yineza7 opened 8 months ago
Attached to this comment is a list of datasets that I have found so far, the majority of them archive-1.zip are news articles but there are also outliers like Reddit comments and reviews from retail stores. Also, i will attach the files to 2 different comments since the size of the datasets so far is too large for attachment reddit-comments.zip reviews.zip walmart reviews.zip walmart_reviews_2.zip
@Obasjoe Could you add the link to the location of the data for easier direct access?
https://www.kaggle.com/datasets?search=documents Kaggle, a Google subsidiary, is a virtual gathering place for data scientists and machine learning practitioners. It provides a platform where users can discover datasets for AI model development, share datasets, collaborate with other data enthusiasts, and participate in contests to address data science problems. Since its inception in 2010, Kaggle has been hosting machine learning and data science competitions, and it also provides a public data platform and cloud-based workspace for data science and AI learning.
https://paperswithcode.com/datasets?task=text-summarization&page=1 Papers with Code is a community-driven platform that provides a comprehensive resource for Machine Learning research, including papers, code, datasets, and evaluation methods. The platform encourages open collaboration, facilitated by NLP and ML technologies. All content is freely available under the CC-BY-SA license, similar to Wikipedia, and contributions from users are welcomed. In addition to its main focus, the platform also hosts specialized sections for fields like astronomy, physics, computer sciences, mathematics, and statistics.
https://www.kaggle.com/datasets/jpmiller/layoutlm medical dataset, says it can do text summarization, 30GB download
https://paperswithcode.com/dataset/massivetext Another potential dataset, looks like it can be terabytes of data, careful when downloading.
https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset Not sure if the previous two were summarization datasets, this one should be. Based off of medical data.
Reasearch articles: https://paperswithcode.com/dataset/arxiv-summarization-dataset
sticking to a particular genre (i.e, news article)