Broken case-insensitive search for cyrillic text

zed-industries / zed

Code at the speed of thought – Zed is a high-performance, multiplayer code editor from the creators of Atom and Tree-sitter.

https://zed.dev

Other

50.07k stars 3.07k forks source link

Broken case-insensitive search for cyrillic text #9980

Open detouched opened 7 months ago

detouched commented 7 months ago

Check for existing issues

[X] Completed

Describe the bug / provide steps to reproduce it

The text search seems to be always case-sensitive when looking for a string with Cyrillic characters.

Check out this sample:

ПРИВЕТ
привет
HELLO
hello

Now, if you search for hello, both last rows match, whereas if you search for привет, only the second one does. Switching to Match case and back makes no difference.

Environment

Zed: v0.128.3 (Zed) OS: macOS 14.4.0 Memory: 32 GiB Architecture: x86_64

If applicable, add mockups / screenshots to help explain present your vision of the feature

Search with disabled Match case option should find all substrings regardless of the case.

If applicable, attach your `~/Library/Logs/Zed/Zed.log` file to this issue.

No response

VKondakoff commented 7 months ago

Can confirm this issue.

Zed: v0.129.1 (Zed Preview) OS: macOS 12.7.4 Memory: 32 GiB Architecture: x86_64

petros commented 7 months ago

Confirmed with Greek too:

CleanShot 2024-03-30 at 20 58 32

CleanShot 2024-03-30 at 20 59 13

petros commented 7 months ago

Case insensitivity works with Greek when you switch to Regex.

petros commented 7 months ago

It seems with Text search we are using:

https://github.com/zed-industries/zed/blob/b1ccead0f66ada772ee2e0e5f76dfdd8b9473340/crates/project/src/search.rs#L60-L85

which uses ascii_case_insensitive which states in the docs:

NOTE: It is unlikely that support for Unicode case folding will be added in the future. The ASCII case works via a simple hack to the underlying automaton, but full Unicode handling requires a fair bit of sophistication. If you do need Unicode handling, you might consider using the regex crate or the lower level regex-automata crate.

I guess full Unicode handling is not as straightforward it seems. Maybe a workaround for now is just switching to RegEx search which seems to handle those cases fine?

github-actions[bot] commented 1 month ago

Hi there! 👋 We're working to clean up our issue tracker by closing older issues that might not be relevant anymore. Are you able to reproduce this issue in the latest version of Zed? If so, please let us know by commenting on this issue and we will keep it open; otherwise, we'll close it in 10 days. Feel free to open a new issue if you're seeing this message after the issue has been closed. Thanks for your help!

dragnev-dev commented 1 month ago

Are you able to reproduce this issue in the latest version of Zed?

I confirm the issue is reproducible in the latest version of Zed.

Zed: v0.155.2 OS: Linux Fedora 39 Memory: 48 GiB Architecture: x86_64

rauberdaniel commented 1 month ago

The same issue happens with umlauts (ÄäÖöÜü) in German texts (so probably any non-ASCII characters)

zed-industries / zed

Broken case-insensitive search for cyrillic text #9980

Check for existing issues

Describe the bug / provide steps to reproduce it

Environment

If applicable, add mockups / screenshots to help explain present your vision of the feature

If applicable, attach your ~/Library/Logs/Zed/Zed.log file to this issue.

If applicable, attach your `~/Library/Logs/Zed/Zed.log` file to this issue.