finding more datasets - Githubissues

yineza7 commented 8 months ago

sticking to a particular genre (i.e, news article)

Obasjoe commented 8 months ago

Attached to this comment is a list of datasets that I have found so far, the majority of them archive-1.zip are news articles but there are also outliers like Reddit comments and reviews from retail stores. Also, i will attach the files to 2 different comments since the size of the datasets so far is too large for attachment reddit-comments.zip reviews.zip walmart reviews.zip walmart_reviews_2.zip

Obasjoe commented 8 months ago

BBC-news-summary.zip

yineza7 commented 7 months ago

@Obasjoe Could you add the link to the location of the data for easier direct access?

yineza7 commented 7 months ago

https://www.kaggle.com/datasets?search=documents Kaggle, a Google subsidiary, is a virtual gathering place for data scientists and machine learning practitioners. It provides a platform where users can discover datasets for AI model development, share datasets, collaborate with other data enthusiasts, and participate in contests to address data science problems. Since its inception in 2010, Kaggle has been hosting machine learning and data science competitions, and it also provides a public data platform and cloud-based workspace for data science and AI learning.
https://paperswithcode.com/datasets?task=text-summarization&page=1 Papers with Code is a community-driven platform that provides a comprehensive resource for Machine Learning research, including papers, code, datasets, and evaluation methods. The platform encourages open collaboration, facilitated by NLP and ML technologies. All content is freely available under the CC-BY-SA license, similar to Wikipedia, and contributions from users are welcomed. In addition to its main focus, the platform also hosts specialized sections for fields like astronomy, physics, computer sciences, mathematics, and statistics.

ColinThomas1 commented 7 months ago

https://www.kaggle.com/datasets/jpmiller/layoutlm medical dataset, says it can do text summarization, 30GB download

ColinThomas1 commented 7 months ago

https://paperswithcode.com/dataset/massivetext Another potential dataset, looks like it can be terabytes of data, careful when downloading.

ColinThomas1 commented 7 months ago

https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset Not sure if the previous two were summarization datasets, this one should be. Based off of medical data.

yineza7 commented 7 months ago

Reasearch articles: https://paperswithcode.com/dataset/arxiv-summarization-dataset

yineza7 / Summarization-of-a-stack-of-papers-using-LLMs-

finding more datasets #2