When reading UTF-8 text, EOF is not detected.

In gli_getchar_utf8() in garglk/cgunicod.c, the return value of getc() is 
stored into a glui32.  It is then checked to see if it is < 0, in an attempt to 
detect EOF.  However, since glui32 is unsigned, it will never be less than 
zero.  When EOF is reached, it is not detected as such, but instead (assuming 
EOF is -1) is seen as the start of an invalid multibyte sequence.

Following are three possible fixes that I've come up with.

Change the glui32 values to ints.  This assumes that int is at least 32 bits 
(well, technically 21 bits) because for a 4-byte sequence, the value is 
left-shifted 18 bits.  With a 16-bit int, this is undefined behavior.  If 
Gargoyle allows 16-bit ints, this is not a good solution.  If it requires 
32-bit ints, this should be fine.  The 16-bit int problem could also be fixed 
by strategic casting of the ints to glui32: as far as I can tell, this would 
only be needed where they are being left-shifted by 12 or 18.

Another solution would be to keep using glui32 but compare explicitly against 
EOF.  This is imperfect because a system might theoretically have 64-bit ints 
with EOF defined as, say, INT_MIN.  This is almost surely less likely than a 
16-bit int, however.
The same problem would arise were glsi32 used.

The best general solution, in my opinion, would be to create a wrapper around 
getc() that does something like:
static inline glui32 read_byte(FILE *fl)
{
  int c;
  c = getc(fl);
  if(c == EOF) return -1;
  return c;
}

This way glui32 can continue to be used, and can explicitly be compared to 
(glui32)-1 (it's guaranteed that -1 will be converted to the largest possible 
value a glui32 can hold, and thus there will be no undefined/implementation 
defined behavior).  It might be expected that this would run slower due to the 
extra checks, but when I profile it, I get mixed results.  Gcc is slightly 
faster, clang is the same (it generate the same code but for one case of jbe vs 
jle), and Sun Studio and Intel's compiler are slightly slower.  The slowdown 
was around 2-5% when it existed.  Gcc was about 2% faster.  Note that gcc 
becomes slower if the function is not inlined, either explicitly or by using 
-O3.

I've attached a patch that implements the third solution.  In doing the 
comparisons I'm comparing to (glui32)-1 instead of -1, defensively: if int is 
64 bits, the comparison will promote the glui32 to an int instead of the other 
way around, and the comparison will fail.

Either the first method (with casts) or the last would be portable to all 
systems, if I'm not mistaken.  I'm not sure which is less ugly...
Original issue reported on code.google.com by cspiegel@gmail.com on 26 Aug 2010 at 6:41
Attachments:
utf8.diff
weewonder / garglk

When reading UTF-8 text, EOF is not detected. #117