yingchengsun / IntelBase

Domain information collection and and analysis platform
1 stars 1 forks source link

Loop of reading file line by line stopped when meeting with ' \u001a' #3

Open yingchengsun opened 6 years ago

yingchengsun commented 6 years ago

When reading RS_title_text files such as 'RS_2007-10_index-title-text.txt' or 'RS_2009-11_index-title-text.txt' line by line, the program ended when meeting with lines containing the character ' \u001a', though there are lines of records unread. It seems that this is a control character: https://stackoverflow.com/questions/17024436/what-is-the-unicode-u001a-character-aka-0x1a Can it be skipped or recognized as a normal character?

yingchengsun commented 6 years ago

I tried to replace it with a normal character like '-' or space, and it worked. However, there are so many records with that weird character, replacement by hand might not be an efficient way to deal with this issue. Still looking for some new methods.

yingchengsun commented 6 years ago

This bug may be caused when prepossessing the raw data: outfile_subm_title_text.write((u'%i\t%s\t%s\n' %(index, title, text )).encode('utf-8')) Decode&encode type may not be appropriate for these weird characters.