Since the input and output text is in Chinese, I add the lines for the open function with specific encoding. If not, I get this kind of error.
>> Synonyms load wordseg dict [D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt] ...
Building prefix dict from D:\installed\miniconda\lib\site-packages\synonyms\data\vocab.txt ...
Loading model from cache C:\Users\LIJIAX~1\AppData\Local\Temp\jieba.u24e2f9dc467017ec363179dba6484c45.cache
Loading model cost 1.352 seconds.
Prefix dict has been built successfully.
>> Synonyms on loading stopwords [D:\installed\miniconda\lib\site-packages\synonyms\data\stopwords.txt] ...
>> Synonyms on loading vectors [D:\installed\miniconda\lib\site-packages\synonyms\data\words.vector] ...
D:\installed\miniconda\lib\site-packages\smart_open\smart_open_lib.py:254: UserWarning: This function is deprecated, use smart_open.open instead. See the
migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Traceback (most recent call last):
File "code/augment.py", line 54, in <module>
gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
File "code/augment.py", line 38, in gen_eda
lines = open(train_orig, 'r').readlines()
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 24: illegal multibyte sequence
And I find the proper input is
0 今天天气不错哦。
1 今天天气不行啊!不能出去玩了。
0 又是阳光明媚的一天!
instead of
0 今天天气不错哦。
1 今天天气不行啊!不能出去玩了。
0 又是阳光明媚的一天!
which make the parts[1] object is "" and the following error message is here.
Traceback (most recent call last):
File "code/augment.py", line 54, in <module>
gen_eda(args.input, output, alpha=alpha, num_aug=num_aug)
File "code/augment.py", line 44, in gen_eda
sentence = parts[1]
IndexError: list index out of range
Since the input and output text is in Chinese, I add the lines for the open function with specific encoding. If not, I get this kind of error.
And I find the proper input is
instead of
which make the
parts
[1] object is""
and the following error message is here.我修改了下编码问题,因为这里的输入和输出都是中文,是非英文本,另外我发现,这里的 train.txt 中间不能空行。