yanshao9798 / tagger

A Joint Chinese segmentation and POS tagger based on bidirectional GRU-CRF
151 stars 44 forks source link

How to use the trained model to tag sentences? #7

Open GabrielLin opened 6 years ago

GabrielLin commented 6 years ago

Could you please show me how to tag sentence by the trained model? Thanks.

yanshao9798 commented 6 years ago

Hi! I added instructions on how to do that in the readme file. Let me know if you encounter any problems.

GabrielLin commented 6 years ago

When I run python tagger.py tag -p ud1 -r raw.txt -m model_ud1 -emb Embeddings/glove.txt -opth tagged_file.txt

It shows the following error:

Numbers of sentences: 1. Longest sentence is 267. Traceback (most recent call last): File "tagger.py", line 408, in raw_x[k] = toolbox.pad_zeros(raw_x[k], max_step) File "/data1/myname/nlp/tagger/toolbox.py", line 826, in pad_zeros return [np.pad(item, (0, max_len - len(item)), 'constant', constant_values=0) for item in l] File "/opt/anaconda2/envs/tf1p3py27/lib/python2.7/site-packages/numpy/lib/arraypad.py", line 1295, in pad pad_width = _validate_lengths(narray, pad_width) File "/opt/anaconda2/envs/tf1p3py27/lib/python2.7/site-packages/numpy/lib/arraypad.py", line 1086, in _validate_lengths raise ValueError(fmt % (number_elements,)) ValueError: (0, -2) cannot contain negative values.

yanshao9798 commented 6 years ago

It works fine on my machine. Please check your raw.txt file. Is it one raw sentence per line? Does it only have one sentence?

GabrielLin commented 6 years ago

I find something, but not very sure. It may about English words with spaces. Such as

伦敦当地时间10月18日18:00(北京时间19日01:00),AlphaGo Zero再次登上世界顶级科学杂志——《自然》。

will causes that error.

But if there are no spaces between English words. Such as

伦敦当地时间10月18日18:00(北京时间19日01:00),AlphaGo再次登上世界顶级科学杂志——《自然》。

It is OK.

yanshao9798 commented 6 years ago

Ok. I'll try to fix this.

yanshao9798 commented 6 years ago

I tested your sentence and it seemed to work fine, but I made some small changes anyway. Please try again and see if it works now.

GabrielLin commented 6 years ago

raw_en.txt

Your response speed is amazing. In my side, the error remained. Please help to try the file directly. Thanks.

yanshao9798 commented 6 years ago

Ok. I fixed some minor stuff. Could you try again? Thanks!

GabrielLin commented 6 years ago

Thanks. It does not show any error messages now. But the result may be better. In my model, it separate AlphaGo into 'Alpha' and 'Go', then join 'Go' with 'Zero' as 'GoZero'. Do you have this situation?

_NUM 伦敦_PROPN 当地_NOUN 时间_NOUN 10_NUM 月_NOUN 18_NUM 日_NOUN 18:00(_NUM 北京_PROPN 时间_NOUN 19_NUM 日_NOUN 01:00),_NUM Alpha_X GoZero_X 再次_ADV 登上_VERB 世界_NOUN 顶级_ADJ 科学_NOUN 杂志_NOUN ——_NUM 《_PUNCT 自然_NOUN 》_PUNCT 。_PUNCT

yanshao9798 commented 6 years ago

Yes. Because the tagger is not clever enough to utilise the space information. I may fix that later.