wolverine2k / crunchy

Automatically exported from code.google.com/p/crunchy
0 stars 0 forks source link

UnicodeDecodeError in handle_default.py due to open() converting bytestring back into Unicode using the ascii codec #189

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. i use python 2.61 and  Crunchy 1.0 alpha 1 (revision 1044)in windowsxp. 
2. i click on the python.org link in the first page
3. and it blanks out the page while showing an UnicodeDecodeError

What is the expected output? What do you see instead?

The expected output is to browse through python website and show it in the
browser.
I see instead the following error:
Traceback (most recent call last):
  File "C:\Documents and Settings\Administrator\Επιφάνεια
εργασίας\crunchy1.0alpha1\crunchy\src\http_serve.py", line 157, in 
do_POST
    self.server.get_handler(realpath)(self)
  File "C:\Documents and Settings\Administrator\Επιφάνεια
εργασίας\crunchy1.0alpha1\crunchy\src\plugins\handle_default.py", line 
84,
in handler
    data = path_to_filedata(request.path, root_path, request.crunchy_username)
  File "C:\Documents and Settings\Administrator\Επιφάνεια
εργασίας\crunchy1.0alpha1\crunchy\src\plugins\handle_default.py", line 
60,
in path_to_filedata
    return open(npath.encode(sys.getfilesystemencoding()),
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 40:
ordinal not in range(128)

Please use labels and text to provide additional information.
UnicodeDecodeError happens propably because lack of handling the encoding
for windows os with the the greek MUI(multilingual user interface)
When i type in python interpreter 
import sys
sys.getfilesystemencoding
i get a string 'mbcs'
which looks not applicable to encode the greek letters
so that's why path_to_filedata method raises a  UnicodeDecodeError
i think it's better for windows  to use :
import locale 
locale.getdefaultlocale()
which produces a tuple ('el_GR', 'cp1253')
then slice it like this:
locale.getdefaultlocale()[1]
and you will get your string  'cp1253'
so maybe you can replace sys.getfilesystemencoding()
with locale.getdefaultlocale()[1] in case you are sure the user
uses windows.
    Also why crunchy tries  to connect with www.skyhookwireless.com ? Is
that something that should not make users of crynchy worried about their
privacy?Are you aware that programms connecting somewhere without the
explicit permission of the user are generally considered to be harmful or
even malware?

Original issue reported on code.google.com by gianni...@gmail.com on 20 Jan 2009 at 9:37

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Sorry i should have written Label:OpSys-Windows maybe previous

Original comment by gianni...@gmail.com on 20 Jan 2009 at 9:42

GoogleCodeExporter commented 8 years ago
Thanks for the report. There are indeed problems on Windows when the user's 
name or
other path components contain non-ascii characters.   I can reproduce the issue 
here
(my normal username on a Windows computer has a non-ascii character in it).

If you want to run Crunchy on Windows, the only option at this point would be 
to make
sure you install it in a path that only contains ascii characters.  If you 
could do
this (even if only as a test) and report the result, it would be much 
appreciated -
especially since I have not installed Python 2.6 yet.

As for the connection to skyhookwireless, I have NO idea.  It should NEVER 
connect
anywhere without user intervention. I can certainly understand why you would be 
upset
seeing this...

How is it that you notice that Crunchy tries to connect to skyhookwireless?  

Are you sure you didn't have something else running in your browser at the same 
time?
Otherwise, the only (far-fetched?) explanation I can think of is that, somehow,
localhost (127.0.0.1) is mapped to this address on your system due to some 
malware
already present.   I obviously can't reproduce the bug here. 

Original comment by andre.ro...@gmail.com on 20 Jan 2009 at 10:24

GoogleCodeExporter commented 8 years ago
Hi andre roberge:
First of all i would like to admit that i was wrong about crunchy tried to 
connect to
a domain.It was me before some time that i 've changed my hosts file for 
security
reasons but i found that i have mapped by mistake 127.0.01 to the www.sk*** 
domain.
I am really sorry for the misunderstanding .

About the crunchy, yes i did check it if it works .I moved the crunchy 
directories
to the c:\ level and however i found out  another "problem" .
More precisely when i clicked the link to python.org from crunchy i got (i tried
seceral times) the following errors:

404 NOT FOUND: /www.python.org

Auê! (oops!) The page you are looking for (/www.python.org) is no longer here 
or has
been moved!
That page had this as a title --> it's all lost and stoof

Original comment by gianni...@gmail.com on 22 Jan 2009 at 7:32

GoogleCodeExporter commented 8 years ago
I will enter the 404 as a separate issue; thanks for the bug report.

The UnicodeDecodeError bug remains :-(

Original comment by andre.ro...@gmail.com on 22 Jan 2009 at 11:35

GoogleCodeExporter commented 8 years ago
Adding labels.

Original comment by andre.ro...@gmail.com on 22 Jan 2009 at 11:38

GoogleCodeExporter commented 8 years ago
Hi again! :-) I think the following might help you :-)
First I starded my cmd from a path with greek and rare-used letters and spaces
in order to get a full "buggy" string...
The original buggy path is:
C:\Documents and Settings\user\Επιφάνεια 
εργασίας\%τέστ_όϊ(#2011
SO i typed to python interpreter the folowing:

import locale,os,sys
#now i check the encodings on the system
print locale.getdefaultlocale()
# nice we have one tuple of the user locale :-)
#i get ('el_GR', 'cp1253')
print locale.getpreferredencoding() 
# nice i get the preferred encoding
# i get 'cp1253'
print sys.stdout.encoding       
# nice this is console's default encoding
# i get 'cp737' which it is 100% ok because i typed  chcp and i got 737 code 
page :-)
print sys.getfilesystemencoding 
# now the mbcs string
#I get 'mbcs'                  
os.getcwd()                     
#I printed the current woking directory
# i got 

 'C:\\Documents and Settings\\user\\\xc5\xf0\xe9\xf6\xdc\xed\xe5\xe9\xe1
\xe5\xf1\xe3\xe1\xf3\xdf\xe1\xf2\\%\xf4\xdd\xf3\xf4_\xfc\xfa(#2011' 

#hmm it looks inappropriate
#after i tried to decode the "buggy" string
#it returned a unicode string 
#This time using cp737
os.getcwd().decode('cp737')
# i got  

u'C:\\Documents and
Settings\\user\\\u253c\u038f\u03ce\xf7\u2584\u038a\u03af\u03ce\u03ac
\u03af\xb1\u03ae\u03ac\u2264\u2580\u03ac\u2265\\%\u03aa\u258c\u2
264\u03aa_\u207f\xb7(#2011' 

#then i printed it and i got

C:\Documents and Settings\user\┼Ώώ÷▄Ίίώά 
ί±ήά≤▀ά≥\%Ϊ▌≤Ϊ_ⁿ·(#2011

# hmmm first attempt rejected!!!
# a second time with mbcs
os.getcwd().decode('mbcs')
# and i got this unicode string 

u'C:\\Documents and Settings\\user\\\u0395\u03c0\u03b9\u03c6\u03ac\u03bd\u03b5
\u03b9\u03b1 \u03b5\u03c1\u03b3\u03b1\u03c3\u03af\u03b1\u03c2\\%\u03c4\u03a
d\u03c3\u03c4_\u03cc\u03ca(#2011'

# then i printed this one too and i got
C:\Documents and Settings\user\Επιφάνεια 
εργασίας\%τέστ_όϊ(#2011
#Bingo !! it's ok
#Finally i tried the cp1253 encoding to be "sure" :-)
os.getcwd().decode('cp1253')
#and i got  this unicode string

u'C:\\Documents and Settings\\user\\\u0395\u03c0\u03b9\u03c6\u03ac\u03bd
\u03b5\u03b9\u03b1 
\u03b5\u03c1\u03b3\u03b1\u03c3\u03af\u03b1\u03c2\\%\u03c4\u03a
d\u03c3\u03c4_\u03cc\u03ca(#2011'

# and i printed that too and i got the right again

C:\Documents and Settings\user\Επιφάνεια 
εργασίας\%τέστ_όϊ(#2011
#Bingo again
# I see that the mbcs (or better MBCS) mappings to unicode  are equal to 
#my computer at least for greek letters and although i did'nt check the CJK
#character sets about East Asian languages i am sure there must be a similar 
#treatment

#About how not to raise a UnicodeDecodeError i think that
#using unicode by default internally and then checking the system locale
# and splitting out the right encoding acording the system's language should be 
a
#good way to get a solution.
#This is exactly what i "inherited" from a page at a site
#The link to that page is :
#http://www.amk.ca/python/howto/unicode
#As of that page suggests or maybe enforces
"""
    Software should only work with Unicode strings internally, converting to a
particular encoding on output.

If you attempt to write processing functions that accept both Unicode and 8-bit
strings, you will find your program vulnerable to bugs wherever you combine the 
two
different kinds of strings. Python's default encoding is ASCII, so whenever a
character with an ASCII value >127 is in the input data, you'll get a
UnicodeDecodeError because that character can't be handled by the ASCII 
encoding. """

#I didn't solve the UnicodeDecodeError  problem but at least i think i now
#understand quite enough about it and in a matter that what i wrote ,can be 
helpful
#to the readers or maybe to the project it shelf .

Original comment by gianni...@gmail.com on 22 Jan 2009 at 9:17

GoogleCodeExporter commented 8 years ago
sorry about MBCS i wanted to write DBCS (Double Byte Character Set)

Original comment by gianni...@gmail.com on 22 Jan 2009 at 9:23

GoogleCodeExporter commented 8 years ago
Wasn't that helpful? I am curious 

Original comment by gianni...@gmail.com on 24 Jan 2009 at 6:05

GoogleCodeExporter commented 8 years ago
Sorry - I did not have time to investigate; I'm kind of swamped with other stuff
right now.   I will be posting an update here when I have the time.

Original comment by andre.ro...@gmail.com on 24 Jan 2009 at 10:11

GoogleCodeExporter commented 8 years ago
It's ok i understand you.

Original comment by gianni...@gmail.com on 25 Jan 2009 at 12:51

GoogleCodeExporter commented 8 years ago
Hello bug reporter!

If you

(1) Run the Python interpreter; and
(2) Execute open(u"C:/Documents and Settings/Administrator/Επιφάνεια
εργασίας/[any file you have lying around]"),

does an exception show up? I'm porting Crunchy to Python 3 and I'm hoping that 
Python   
2.6 on Windows locales does the right thing when the path is Unicode. If not, 
I'll 
put in the locale workaround that you suggested.

File paths are generally a pain since a lot of environments don't have the 
right 
locales set up for Python to figure out what to do (even too incorrect for the 
locale 
module to return the right value). This might have to eventually be a 
configuration 
option where the user inputs his locale in as a user-friendly manner as he can.

Original comment by shadytr...@gmail.com on 10 Jul 2009 at 10:12

GoogleCodeExporter commented 8 years ago
More research on this: On Windows, when open() receives a bytestring, it will 
try to 
convert it into Unicode. open() calls file_init in fileobject.c [1]. With a 
bytestring, wideargument is then always 0 and therefore the arguments get 
parsed as 
"et|si:file" or a bytestring that's encoded as Py_FileSystemDefaultEncoding 
(sys.getfilesystemencoding()), which is always mbcs on Windows. That's fine; 
since we 
encoded the string as mbcs with getfilesystemencoding, decoding it as mbcs 
shouldn't 
be a problem. But on your machine, the ascii codec somehow gets called to 
decode the 
bytestring, and I'm completely baffled as to how that's happening since 
getfilesystemencoding() is returning mbcs for you.

[1]: http://svn.python.org/view/python/trunk/Objects/fileobject.c?
revision=73686&view=markup
[2]: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?
revision=73776&view=markup

For André's branch and probably the 1.0 release, path_to_filedata is passing 
Unicode 
to open(), which bypasses the encoding altogether since Windows natively uses 
Unicode 
for its filesystem. On other systems, Unicode paths get encoded down to 
nl_langinfo(CODESET) [3] as expected. This is the option we took for Python 2, 
and 
it's the only option available for Python 3. In light of this, I'm closing this 
bug 
(as Fixed, since there doesn't seem be a more nuanced, correct status) since 
it's 
been obsoleted by these changes.

[3]: http://svn.python.org/view/python/trunk/Python/pythonrun.c?
revision=71152&view=markup

Original comment by shadytr...@gmail.com on 13 Aug 2009 at 1:02