ycm-core / ycmd

A code-completion & code-comprehension server
https://ycm-core.github.io/ycmd/
GNU General Public License v3.0
1.69k stars 764 forks source link

Support Unicode 15.1 new GB9c break rule #1718

Closed DonKult closed 9 months ago

DonKult commented 10 months ago

ycmd embeds its unicode support files and tests (currently for version 13), but a script (update_unicode.py) is provided to update to the latest unicode version. This used to work to upgrade to version 14, but doesn't anymore with 15. The tests fail for example with:

[ RUN      ] UnicodeTest/WordTest.BreakIntoCharacters/1186
./cpp/ycm/tests/Word_test.cpp:60: Failure
Value of: Word( word_.text_ ).Characters()
Expected: { *{ "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", false, true, false, false } }
  Actual: { *{ "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क", "\xE0\xA4\x95"
    As Text: "क", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क्, false, true, false, false }, *{ "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", true, true, false, false } }

[  FAILED  ] UnicodeTest/WordTest.BreakIntoCharacters/1186, where GetParam() = { "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", { "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत" } (0 ms)

The reason is that 15.1 introduces a new rule for (not) breaking: GB9c and of course the new tests exercising this rule fail now.

Prior art implementing this rule elsewhere: https://github.com/JuliaStrings/utf8proc/pull/253

Would be nice if support for newer Unicode standards could be added to ycmd.

puremourning commented 10 months ago

PR welcome.

bstaletic commented 10 months ago

I have recently tried doing a naive upgrade to the latest unicode standard, but have seen that the tests are failing.

The reason is that 15.1 introduces a new rule for (not) breaking: GB9c and of course the new tests exercising this rule fail now.

Thanks for tracking this down. I have stopped at the previous step, because I was busy. In case you do want to contribute:

https://github.com/ycm-core/ycmd/blob/master/cpp/ycm/Word.cpp#L31

bstaletic commented 10 months ago

@DonKult I am afraid I am having troubles understanding the new rule. Up to now the boundary rules table contained only break properties that were explained in the values table. Now that GB9c is added, it talks about "Indic_Conjunct_Break" (InCB), but I don't see it defined anywhere.

On top of that, InCB is not mentioned in the break property data either.

 

EDIT: Found it!

Now, this is a new property. That means we will need to extend our UnicodeData.inc with one more data member. :/ It's definitely doable, but it does make me wish we had a sparse vector to save on space.

EDIT2: I understand the point of having links to the versioned documents in update_unicode.py, but half of thtem already aren't and it is now just confusing. We should pick a side.

I know, I know... I am to blame for that mess.

bstaletic commented 9 months ago

We have merged the update.