UnicodeDecodeError in Python version when using Cyrillic characters.

while0pass / google-diff-match-patch

Automatically exported from code.google.com/p/google-diff-match-patch

Apache License 2.0

0 stars 0 forks source link

Want to use Cyrillic characters with diff_match_patch (python version, release), but got errors like: "UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 0: unexpected end of data" appending in some places to strings ".decode("utf-8").encode("utf-8")", seem to solve the problems, but I guess not 100%. see the attached patch (and for any case new file). Alexandr.

Thank you Alexandr for the bug report and the patches.  Sorry for the delay.  I 
have
fixed the Unicode issues in diff_fromDelta and patch_fromText.  In both cases I 
added:
    if type(text) == unicode:
      text = text.encode("ascii")
These are two functions which are expecting a subset of ASCII characters.

However, your patch also made changes to diff_text1, diff_text2, patch_apply
and patch_obj.__str__.  Despite many tests, I am unable to find scenarios where
the existing code fails when passed Unicode.  An example testcase would be most
apreciated.

In the mean time, I've pushed out a new version which includes the Unicode 
fixes for
diff_fromDelta and patch_fromText in the Python version, as well as a new unit 
test
in all three versions which verifies the behaviour of invalid Unicode sequences 
(e.g.
%c3%xy).

Original comment by neil.fra...@gmail.com on 14 May 2008 at 7:47

Changed state: Fixed

while0pass / google-diff-match-patch

UnicodeDecodeError in Python version when using Cyrillic characters. #9