zulip / zulip

Zulip server and web application. Open-source team chat that helps teams stay productive and focused.
https://zulip.com
Apache License 2.0
21.69k stars 7.91k forks source link

Enable searching for Unicode phonetic and diacritic characters in isolation #8492

Open brannerchinese opened 6 years ago

brannerchinese commented 6 years ago

Zulip's search functionality doesn't seem to be able to find single characters from the several Unicode Phonetic Extensions and Combining Diacritical Marks code planes, if they occur in connected transcription in a message. An arcane request? (User bowed deeply.) Perhaps, but this is a real life use-case that would expand Zulip's utility for people working in philology and linguistics. There are two interesting varieties of the problem:

  1. Ordinary IPA symbols.

    I sent myself a message containing two imaginary strings of symbols in IPA: θɤŋ ʓᶕ. If I search for them as contiguous substrings (θɤŋ or ʓᶕ) Zulip finds them. But if I search for any of the five characters in isolation, Zulip does not find this message.

    Interestingly, it does find a different message ending in θ if I search for that character alone. The message it finds is rendered do you speak μαθ? and looking at the source of the message I can see that the three IPA characters are composed as math, not IPA, like this:

    ```math
    do\,you\,speak\,\mu\alpha\theta?
    ``` 
  2. Combining characters that almost always occur overstruck with other characters, so that except for IPA and Unicode specialists most users don't conceive of them as discrete glyphs. But discrete glyphs they are, and sometimes they need to be searched for as such.

    I sent myself messages containing three of them chosen at random, first in combination with other letters and then in isolation. If I search for the combinations, the search functionality finds them; if I search for them in isolation, it does not. I'd like the latter search to succeed.

    As per @timabbott's late comment in #8474, here are the three characters, including their glyphs for easy copying and their code points for easy generation:

    In [1]: print('m̥ o̦ n̑') # in combination with other letters
    m̥ o̦ n̑
    
    In [2]: print('  {}  {}  {}'.format('\u0325', '\u0326', '\u0311')) # in isolation
     ̥  ̦  ̑
timabbott commented 6 years ago

@sampritipanda obviously this is not the highest priority issue since it's fairly obscure, but if you're interested in investigating this, I'd be curious to understand what's happening here.

zulipbot commented 6 years ago

Hello @rsarky!

You have attempted to claim an issue without the label "help wanted". It seems like you are new to Zulip, so we suggest working on an issue with the help wanted or good first issue label first. We also recommend reading our guide for new contributors before getting started.

To claim this issue anyway, comment on this issue again with the command @zulipbot claim --force.

zulipbot commented 6 years ago

Hello @zulip/server-search members, this issue was labeled with the area: search label, so you may want to check it out!

sampritipanda commented 6 years ago

The issue here is with tsearch (postgres' inbuilt text search) which doesn't allow you to search for a part of the word. It treats each pair as a word and the combining characters as part of the word, which is why you can search for the pair but not the specific character. If you send a message m̥o̦n̑ without spaces, then searching for a combined pair () will not return anything.

This issue is fixed if you use the Pgroonga search backend which is much more flexible. I believe zulipchat uses tsearch though.