zot / microfts

Small and fast FTS (full text search)
MIT License
32 stars 3 forks source link

Non Latin scripts #13

Open oatmealm opened 3 years ago

oatmealm commented 3 years ago

Hi. Wanted to ask if this should work with non Latin scripts. I've installed and quickly tested and it seems it will not find anything searching for texts in Hebrew or Arabic. I only had spent a short time testing so sorry if I'm missing something here.

zot commented 3 years ago

Oh boy. I'll probably need help do do that. Do you program in Go by any chance?

I chose to index on 0-9, a-z, and "anything else" in order to fit a trigram into 16-bIts. That means it essentially just supports ASCII right now.

Supporting other character sets would need some redesign. There are some different approaches to consider for that. For full unicode, I'd need to tell it how to ignore punctuation in each character set.

Specifying the character on database creation and making org-fts pick the DB based on character set might be the best way to start out, adding support for one character set at a time.

Want to help out with this? We could make Hebrew and Arabic the first character sets supported besides English....

oatmealm commented 3 years ago

Thanks for the reply. No unfortunately I don't.

It looks very promising though, what I've seen so far, both performance and integration in Emacs. I mostly use org for study/research notes.

I'm checking if it's doom-Emacs specific, or my configuration, where pressing enter will not open the file from ivy's candidates list. I can see the file and context snippet but I can't open the file.

zot commented 3 years ago

I'll look into it -- the project is very new so I'm sure there are bugs I haven't found. It doesn't seem to be behaving for me, either, at the moment...

zot commented 3 years ago

Thinking about this, adding multilingual support is possible for alphabets with 29 or fewer characters. The document ("group") record could have a translation table for I/O. 29 characters plus 10 digits plus "whitespace" makes 40 and 40^3 is 64,000, which just fits into 16 bits...