yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
825 stars 138 forks source link

训练新模型遇到问题 #136

Closed Hairmore closed 4 months ago

Hairmore commented 7 months ago

python -u -m supar.cmds.dep.biaffine train -b -d 0 -c dep-biaffine-xlmr -p model --train train.conllx \ --dev dev.conllx \ --test test.conllx \ -encoder=bert \ --bert=xlm-roberta-large \ --lr=5e-5 \ --lr-rate=20 \ --batch-size=500 \ --epoch=5 \ --update-steps=4 我的数据最开始是conllu格式,直接修改后缀为conllx。在运行这段代码时遇到: “File "supar\models\dep\biaffine\transform.py", line 422, in load for line in lines: UnicodeDecodeError: 'gbk' codec can't decode byte 0x94 in position 39: illegal multibyte sequence” 这个错误在我将文件名进行如是修改 train.conllx --> train.conllx.txt后消失. 开始进行Building the fields Building the model [2023-12-14 19:19:58 INFO] BiaffineDependencyModel( (encoder): TransformerEmbedding(xlm-roberta-large, n_layers=4, n_out=1024, stride=256, pooling=mean, pad_index=1, finetune=True) (encoder_dropout): Dropout(p=0.1, inplace=False) (arc_mlp_d): MLP(n_in=1024, n_out=500, dropout=0.33) (arc_mlp_h): MLP(n_in=1024, n_out=500, dropout=0.33) (rel_mlp_d): MLP(n_in=1024, n_out=100, dropout=0.33) (rel_mlp_h): MLP(n_in=1024, n_out=100, dropout=0.33) (arc_attn): Biaffine(n_in=500, bias_x=True) (rel_attn): Biaffine(n_in=100, n_out=2, bias_x=True, bias_y=True) (criterion): CrossEntropyLoss() ) 但是在caching the data步骤报错: 捕获2

不知道是不是文件格式的问题?请问可以请求一份您的训练数据进行测试吗? 我的数据格式为 捕获 十分感谢!!!!

yzhangcs commented 6 months ago

@Hairmore Hello,抱歉很晚回复你的问题,.conllx请尽量使用utf8编码,.txt文件有特殊用途,表示纯文本文件

Hairmore commented 6 months ago

It's fine, No need for apologizing. Very grateful for your work and help!!!!! I have found the reason for this problem. It's because it's trained under Windows. I switched to Linux and this problem disappeared. Thx a lot !!!!!!!!!

Hairmore commented 6 months ago

@Hairmore Hello,抱歉很晚回复你的问题,.conllx请尽量使用utf8编码,.txt文件有特殊用途,表示纯文本文件

Oh, the "txt" is to solve another weird problem. Under windows, if I have .conllu, that problem pops out. But by adding txt to the end, that problem is gone. Still don't know why

Sorry for using English, I haven't had Chinese input method on my Ubuntu yet.

yzhangcs commented 6 months ago

recommend to use conllu format files with .conllu/.conllx extension on Linux, which is my practice.

Hairmore commented 6 months ago

recommend to use conllu format files with .conllu/.conllx extension on Linux, which is my practice.

Yes, under Linux with .conllu, everything went smoothly

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.