Diacritical marks are ignored when differientiating

hadithmv commented 2 years ago

While mergely does not by default ignore things like whitespace, case and accents, it does however ignore diacritical marks.

I've included a picture as well as a diff file that shows this for a couple of languages:

RL2vASt1.zip

hadithmv commented 2 years ago

It seems I can only download a single diff file from the site, yet import requires two separate files for two different sides. So instead here's a link: https://editor.mergely.com/UsHJyREC/

hadithmv commented 2 years ago

Or because merge left/right arrows are there, it does know that there are differences, just doesn't visually show them with colors as expected.

While the diacritic mark is small by itself, in actual use it always goes on top of or under another letter. So ideally, both the diacritic and the letter it is on should be highlighted during differentiation. Otherwise, the word.

wickedest commented 2 years ago

@hadithmv, thanks for reporting the issue. there are actually 2 separate diffs being run. The 1st is by line, and only detects lines that are different. The 2nd is by character, within the line. It appears the 2nd diff may be failing diacritical marks for some reason.

hadithmv commented 5 months ago

Its been a while, and I don't know if this would be of help, but,

This is how vscode's shows it on their diff compare, it highlights the exact place where the change occurs. I thought that since it was open source, it might be possible to check out how they are getting this to work.

hadithmv commented 5 months ago

here's how another diff compare tool fixed this particular issue:

https://gitlab.gnome.org/GNOME/meld/-/commit/d272016a481a9f9563d318a79ce9767cb8ad69ef

might be of help.

wickedest commented 5 months ago

Sorry - this has been on the bottom of the pile for a while.

This is a tough one. Normally, diacriticals can be combined into a single character, e.g. a + ´ = á. This can be done with String.normalize("NFC"), but normalize does not work in this case.

For example, if we look at كلمة and كَلمة:

I am not a unicode expert, so I'm a bit lost because while it behaves like a diacritical, as far as I can tell, \u064E is something else. I understand diacriticals to be in these ranges (source):

0300-036F
1AB0-1AFF
1DC0-1DFF
20D0-20FF
2DE2-2DFF
FE20-FE2F

\u064E is not in those ranges and is identified as an "Arabic Fatha". Unlike \u0300 is a "Diacritical".

My guess is that the Fatha cannot be combined into a single character using normalize and whenever Mergely encounters these Fatha's, then it should include the adjacent character in the diff (i.e. ['0643', '064E']). But that isn't going to be easy. For starters, outside of the ranges above, I have no idea how many unicode characters behave this way.

wickedest commented 5 months ago

The problem seems to be specifically with nonspacing marks. The diacritical might only be a subset. Only a small number of "common" marks can be normalized into a single character. So the issue here is handling the situation when it cannot be normalized (which I think would happen a lot).

github-actions[bot] commented 5 months ago

:tada: This issue has been resolved in version 5.3.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

wickedest / Mergely

Diacritical marks are ignored when differientiating #169