zeeshansayyed / ArabicSOS

Segmenter and Orthography Standardazier (SOS) for Classical Arabic (CA)
GNU Lesser General Public License v3.0
7 stars 3 forks source link

segmentation to texts or words? #1

Open liutianling opened 4 years ago

liutianling commented 4 years ago

Thanks for you sharing... I want to know if arabic needs segmentation like chinese? I mean if when doing nlp task with arabic, split it to words is needed? Thanks!

zeeshansayyed commented 4 years ago

Yes and no. It needs segmentation, but not like Chinese. It is slightly different that Chinese. Here, most words are separated by space, but many words are glued together and combined into larger words. So there are some spaces present and some absent.

liutianling commented 4 years ago

@zeeshansayyed Thanks! I want to get the embedding of the arabic?If you have any suggestion about the corpus should be separated just by space or other processing? Thanks.

zeeshansayyed commented 4 years ago

There's no single answer to this. Most off-the-shelf Arabic embeddings out there simply use the corpus as is i.e. with the natural spaces which are present in the corpus. People then use a segmenter as a part of the NLP pipeline before performing anything. But it would be interesting to have embeddings of the segmented corpus.

liutianling commented 4 years ago

Thank you very much! I will try some methods. Thanks.