Closed GoogleCodeExporter closed 9 years ago
The mapping that you configure on POS tagger components cannot be used to map
between two fine-grained tagsets (e.g. PTB -> Brown or vice versa). I'll
briefly explain what this mapping is for and then suggest several alternatives.
The DKPro Core type system contains UIMA annotation types representing
coarse-grained tags (very similar to the Universal POS tags
https://code.google.com/p/universal-pos-tags/). The mapping that you can
configure specifies how to map a specific fine-grained tagset used in a corpus
or produced by a tagger to these coarse-grained tags. In your example, you
configure OpenNlpTagger to assume that the model produces tags from the Brown
tagset and use the Brown mapping for the coarse-grained tags. However, the
default OpenNlpTagger model for English produces PTB tags (and also by default
uses the correct PTB->coarse-grained mapping).
E.g. to select all verbs based on the coarse grained UIMA types, you could use
for (POS p : select(jcas, V.class)) {
System.out.println(p.getCoveredText() + " " + p.getClass().getSimpleName());
}
To operate on the fine-grained tags, you would use something like:
for (POS p : select(jcas, POS.class)) {
if (p.getPosValue().startsWith("V")) {
System.out.println(p.getCoveredText() + " " + p.getPosValue());
}
}
POS tagging in your example doesn't seem to be necessary at all, because the
POS tags are read from the Brown corpus by the TeiReader.
If you wanted to apply some higher-level analysis, e.g. run MaltParser, then
you would need to run a POS tagger, because the MaltParser models for English
are trained on the PTB tagset. In that case, you would configure the TeiReader
not to load the POS tags from the corpus.
Something that might meet your needs is the PosMapper [1] component. PosMapper
allows to rewrite the fine-grained POS tags, e.g. to map the PTB variant
produced by TreeTagger to the standard PTB tagset. If there is a proper
conceptual mapping between the Brown and PTB tagset, then you could also use
PosMapper to convert from Brown -> PTB. The mapping file format should be just
like
oldtag1=newtag1
oldtag2=newtag2
... and so on
[1]
http://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/tags/
latest-release/apidocs/de/tudarmstadt/ukp/dkpro/core/posfilter/PosMapper.html
Original comment by richard.eckart
on 3 Jul 2014 at 6:11
Original comment by richard.eckart
on 6 Aug 2014 at 8:24
Original issue reported on code.google.com by
onurs3...@googlemail.com
on 2 Jul 2014 at 11:20