Open zaidalyafeai opened 4 years ago
This is what I know of so far (copied as is from a notebook):
### datasets: https://dl.acm.org/doi/abs/10.1145/2911451.2914677 (https://sites.google.com/view/arabicweb16)
# http://opus.nlpl.eu/
# https://www.kaggle.com/linuxscout/tashkeela
# https://traces1.inria.fr/oscar/ shuffled by line
# https://github.com/zaidalyafeai/ARBML#datasets
# http://opus.nlpl.eu/OpenSubtitles-v2016.php
# https://archive.alsharekh.org/
# https://github.com/soskek/bookcorpus (https://www.smashwords.com) [no Ar]
# http://www.alwaraq.net/Core/index.jsp?option=1 http://dlib.nyu.edu/aco/ (scans)
# https://www.blindarab.net/index.php?action=view_subcat&catid=4&id=24&page=5
# https://www.al-mostafa.com/disp.php?page=list&n=0 (mixed formats)
# https://github.com/abdelrahmaan/Hadith-Data-Sets
# https://www.hindawi.org/books/ epubs, some are not MSA
# https://archive.org/details/texts?and%5B%5D=languageSorter%3A%22Arabic%22&sort=-downloads&page=4
# http://www.al-eman.com/index.htm
# https://catalog.ldc.upenn.edu/topten
Here we combine all the datasets we can collect