zhh007 / ude

Automatically exported from code.google.com/p/ude
Other
0 stars 0 forks source link

pureascii detection issue #3

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. create a text file with just the character "3"
2. save it and run detection.
3. notice that it gives detection failed

What is the expected output? What do you see instead?
expected it to report the file as ascii(happens on any file that had the 
number 3 in it)

What version of the product are you using? On what operating system?
last updated version on windows xp

Please provide any additional information below.

noticed that the code is looking for EscAscii characters and it is looking 
for 0x33 instead of 0x1b. 0x33 is the number 3 and not an escape character.
not sure if there is such an issue anywhere else in the code

Original issue reported on code.google.com by rbhatt%c...@gtempaccount.com on 2 Dec 2009 at 5:29

GoogleCodeExporter commented 9 years ago
ESC is '\033' (that's OCTAL) 3 * 8 + 3 == 29
or '\x1B' ... 1 * 16 + 11 == 29

Original comment by sjmac...@lexicon.net on 31 Mar 2010 at 1:35

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
@sjmac...@lexicon.net
Your comments are a bit unclear to me. First of all octal 33 = 27 dec.
(3 *8+3 / 1*16 + 11 = also 27)
Second, it this now a bug or not?
With some tests is also see that it goes wrong with the number 3.

I have a file with only has the number 3 and it fails the detection. With the 
number
4 it works.

So there is something wrong. At this code the 'non-ascii' value is detected
(UniversalDetector.cs:151-155)
    if (inputState == InputState.PureASCII &&
                        (buf[i] == 0x33 || (buf[i] == 0x7B && lastChar == 0x7E))) {
                        // found escape character or HZ "~{"
                        inputState = InputState.EscASCII;
                    }

The buffer is tested with 33-hex and not 33-octal. The dec-number 3 is 51dec 
which is
33-hex.

So it should be buf[i] == 0x1b

Bottomline. Code should be changed to:

              if (inputState == InputState.PureASCII &&
                        (buf[i] == 0x1B || (buf[i] == 0x7B && lastChar == 0x7E))) {
                        // found escape character (hex 1B) or HZ "~{"
                            // JV: fix. Was buf[i] == 0x33, which is number 3
                        inputState = InputState.EscASCII;
                    }

Original comment by j.verdur...@2bmore.nl on 21 Apr 2010 at 11:02

Attachments:

GoogleCodeExporter commented 9 years ago
Yes, it is clearly a typo. The escape character is 0x1B. I'll patch it as soon 
as I'm 
up and running, hopefully next week.  

Thanks!

Original comment by rudi.pet...@gmail.com on 22 Apr 2010 at 5:26

GoogleCodeExporter commented 9 years ago
Committed. Thanks.

Original comment by rudi.pet...@gmail.com on 14 May 2010 at 11:18