zhanlaoban / EDA_NLP_for_Chinese

An implement of the paper of EDA for Chinese corpus.中文语料的EDA数据增强工具。NLP数据增强。论文阅读笔记。
1.35k stars 241 forks source link

Fix encoding problem. #9

Open JiaxiangBU opened 4 years ago

JiaxiangBU commented 4 years ago

Since the input and output text is in Chinese, I add the lines for the open function with specific encoding. If not, I get this kind of error.

>> Synonyms load wordseg dict [D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt] ...
Building prefix dict from D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt ...
Loading model from cache C:\Users\LIJIAX~1\AppData\Local\Temp\jieba.u24e2f9dc467017ec363179dba6484c45.cache
Loading model cost 1.352 seconds.
Prefix dict has been built successfully.
>> Synonyms on loading stopwords [D:\installed\miniconda\lib\site-packages\synonyms\data\stopwords.txt] ...
>> Synonyms on loading vectors [D:\installed\miniconda\lib\site-packages\synonyms\data\words.vector] ...
D:\installed\miniconda\lib\site-packages\smart_open\smart_open_lib.py:254: UserWarning: This function is deprecated, use smart_open.open instead. See the
migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
  File "code/augment.py", line 54, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 38, in gen_eda
    lines = open(train_orig, 'r').readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 24: illegal multibyte sequence

And I find the proper input is

0   今天天气不错哦。
1   今天天气不行啊!不能出去玩了。
0   又是阳光明媚的一天!

instead of

0   今天天气不错哦。

1   今天天气不行啊!不能出去玩了。

0   又是阳光明媚的一天!

which make the parts[1] object is "" and the following error message is here.

Traceback (most recent call last):
  File "code/augment.py", line 54, in <module>
    gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
  File "code/augment.py", line 44, in gen_eda
    sentence = parts[1]
IndexError: list index out of range

我修改了下编码问题,因为这里的输入和输出都是中文,是非英文本,另外我发现,这里的 train.txt 中间不能空行。