Open brannerchinese opened 6 years ago
@sampritipanda obviously this is not the highest priority issue since it's fairly obscure, but if you're interested in investigating this, I'd be curious to understand what's happening here.
Hello @rsarky!
You have attempted to claim an issue without the label "help wanted". It seems like you are new to Zulip, so we suggest working on an issue with the help wanted or good first issue label first. We also recommend reading our guide for new contributors before getting started.
To claim this issue anyway, comment on this issue again with the command @zulipbot claim --force
.
Hello @zulip/server-search members, this issue was labeled with the area: search label, so you may want to check it out!
The issue here is with tsearch (postgres' inbuilt text search) which doesn't allow you to search for a part of the word. It treats each pair as a word and the combining characters as part of the word, which is why you can search for the pair but not the specific character. If you send a message m̥o̦n̑
without spaces, then searching for a combined pair (m̥
) will not return anything.
This issue is fixed if you use the Pgroonga search backend which is much more flexible. I believe zulipchat uses tsearch though.
Zulip's search functionality doesn't seem to be able to find single characters from the several Unicode Phonetic Extensions and Combining Diacritical Marks code planes, if they occur in connected transcription in a message. An arcane request? (User bowed deeply.) Perhaps, but this is a real life use-case that would expand Zulip's utility for people working in philology and linguistics. There are two interesting varieties of the problem:
Ordinary IPA symbols.
I sent myself a message containing two imaginary strings of symbols in IPA:
θɤŋ ʓᶕ
. If I search for them as contiguous substrings (θɤŋ
orʓᶕ
) Zulip finds them. But if I search for any of the five characters in isolation, Zulip does not find this message.Interestingly, it does find a different message ending in
θ
if I search for that character alone. The message it finds is rendereddo you speak μαθ?
and looking at the source of the message I can see that the three IPA characters are composed as math, not IPA, like this:Combining characters that almost always occur overstruck with other characters, so that except for IPA and Unicode specialists most users don't conceive of them as discrete glyphs. But discrete glyphs they are, and sometimes they need to be searched for as such.
I sent myself messages containing three of them chosen at random, first in combination with other letters and then in isolation. If I search for the combinations, the search functionality finds them; if I search for them in isolation, it does not. I'd like the latter search to succeed.
As per @timabbott's late comment in #8474, here are the three characters, including their glyphs for easy copying and their code points for easy generation: