zaidalyafeai / Arabert

Arabic Language Model based on Bert
18 stars 2 forks source link

Corpora #1

Open zaidalyafeai opened 4 years ago

zaidalyafeai commented 4 years ago

Here we combine all the datasets we can collect

abedkhooli commented 4 years ago

This is what I know of so far (copied as is from a notebook):

### datasets: https://dl.acm.org/doi/abs/10.1145/2911451.2914677 (https://sites.google.com/view/arabicweb16)  
#           http://opus.nlpl.eu/
#           https://www.kaggle.com/linuxscout/tashkeela  
#           https://traces1.inria.fr/oscar/  shuffled by line
#           https://github.com/zaidalyafeai/ARBML#datasets  
#           http://opus.nlpl.eu/OpenSubtitles-v2016.php    
#           https://archive.alsharekh.org/    
#           https://github.com/soskek/bookcorpus  (https://www.smashwords.com) [no Ar]
#           http://www.alwaraq.net/Core/index.jsp?option=1    http://dlib.nyu.edu/aco/ (scans) 
#           https://www.blindarab.net/index.php?action=view_subcat&catid=4&id=24&page=5   
#           https://www.al-mostafa.com/disp.php?page=list&n=0 (mixed formats)
#           https://github.com/abdelrahmaan/Hadith-Data-Sets
#           https://www.hindawi.org/books/    epubs, some are not MSA 
#           https://archive.org/details/texts?and%5B%5D=languageSorter%3A%22Arabic%22&sort=-downloads&page=4
#           http://www.al-eman.com/index.htm  
#           https://catalog.ldc.upenn.edu/topten