Open yingchengsun opened 6 years ago
I tried to replace it with a normal character like '-' or space, and it worked. However, there are so many records with that weird character, replacement by hand might not be an efficient way to deal with this issue. Still looking for some new methods.
This bug may be caused when prepossessing the raw data: outfile_subm_title_text.write((u'%i\t%s\t%s\n' %(index, title, text )).encode('utf-8')) Decode&encode type may not be appropriate for these weird characters.
When reading RS_title_text files such as 'RS_2007-10_index-title-text.txt' or 'RS_2009-11_index-title-text.txt' line by line, the program ended when meeting with lines containing the character ' \u001a', though there are lines of records unread. It seems that this is a control character: https://stackoverflow.com/questions/17024436/what-is-the-unicode-u001a-character-aka-0x1a Can it be skipped or recognized as a normal character?