Closed JingyaXun closed 4 years ago
Did you correctly create all the fields, e.g., WORD
, FEAT
.
As a preprocessing step, you must execute the block here.
Did you correctly create all the fields, e.g.,
WORD
,FEAT
. As a preprocessing step, you must execute the block here.
I presume all fields are created correctly as I did not make any changes to the cloned codes. I made sure that "preprocess" is set to True when I ran the script.
Now the question might be the data format. Can I have a look at some sample instances?
Indeed the Sentence
object does not have the attr words
. But it will be registered in Corpus.load
. You can have a check.
Now the question might be the data format. Can I have a look at some sample instances? Indeed the
Sentence
object does not have the attrwords
. But it will be registered inCorpus.load
. You can have a check.
The data I used is Universal Denpendencies 2.5. It is downloaded from here.
Here is a snapshot of the training data:
head -n 50 train.conllu
# newdoc docid = 1
# sent_id = en_lines-ud-train-doc1-1
# text = Show All
1 Show show VERB IMP Mood=Imp|VerbForm=Fin 0 root _ _
2 All all PRON TOT-PL Case=Nom 1 obj _ _
# sent_id = en_lines-ud-train-doc1-2
# text = About ANSI SQL query mode
1 About about ADP _ _ 5 case _ _
2 ANSI ANSI PROPN SG-NOM Number=Sing 5 compound _ _
3 SQL SQL PROPN SG-NOM Number=Sing 2 flat _ _
4 query query NOUN SG-NOM Number=Sing 5 compound _ _
5 mode mode NOUN _ Number=Sing 0 root _ _
# sent_id = en_lines-ud-train-doc1-3
# text = Some of the content in this topic may not be applicable to some languages.
1 Some some PRON IND Case=Nom 11 nsubj _ _
2 of of ADP _ _ 4 case _ _
3 the the DET DEF Definite=Def|PronType=Art 4 det _ _
4 content content NOUN SG-NOM Number=Sing 1 nmod _ _
5 in in ADP _ _ 7 case _ _
6 this this DET DEM-SG Number=Sing|PronType=Dem 7 det _ _
7 topic topic NOUN SG-NOM Number=Sing 4 nmod _ _
8 may may AUX PRES-AUX VerbForm=Fin 11 aux _ _
9 not not PART NEG _ 11 advmod _ _
10 be be AUX INF VerbForm=Inf 11 cop _ _
11 applicable applicable ADJ POS Degree=Pos 0 root _ _
12 to to ADP _ _ 14 case _ _
Remove the comment lines.
Some lines starting with irregular indices like [0-9]-[0-9]
should also be handled.
Furthermore, in French and Russian (and perhaps other langs as well) treebanks, the spaces in words (like 100 000
) are not cleared, which is also not compatible with my code.
Removing the comment lines does solve my issue. Thank you so much for your help.
:)
I ran into the following error when initiating training:
After some investigation, it seems like the error is cased by getattr() in corpus.py
When I print out name I get "words", which is set in line 25 in cmd.py. However, self.sentences is a list of Sentence object, and by definition Sentence object has no attribute "words", and that is the cause of the AttributeError.
Any idea how I can fix this?