Closed GoogleCodeExporter closed 9 years ago
Original comment by richard.eckart
on 25 Jun 2013 at 10:46
Original comment by richard.eckart
on 25 Jun 2013 at 10:56
Hi,
this sounds very useful and important. Could such a type be used for tagging
text with UBY-"tags"?
E.g., with the "TagSet" type version, that would be something like "name"=
ubySemanticTag and "layer" = semantics
Best
Judith
Original comment by eckle.kohler
on 26 Jun 2013 at 10:20
What I suggest is to preserve information about the tagset, its name, layer,
and its tags. This is meta information which doesn't actually tag anything.
For example, consider running the OpenNLP tagger with a model for German. It
creates annotations of the type POS (or subtypes) which carry a feature
"posValue". What actual values can "posValue" assume and from which inventory
do they come? To record this information, one single "TagSet" feature structure
(not annotation) could be added to the CAS such as:
TagSet {
name: "STTS"
layer= "de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS"
tags = { "NNP", "NN", "ADJ", ... }
}
None of this information refers directly to the text. The text, however, is
annotated with the POS annotations, e.g.
de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS {
begin: 10
end: 15
posValue: "NNP"
}
What does "NNP" mean? I could look up the TagSet feature structures, search for
the one that applies to the POS layer
(de.tudarmstadt.ukp.dkpro.core.api.lexmorph.POS or any subtypes) and see that
"NNP" belongs to the "STTS" tagset. It could be imagined to add a link to some
external normative resource, e.g. ISOCat. For the moment, that's beyond my
use-case, though.
I imagine this can be used for components such as POS taggers or parsers, but
e.g. not for lemmatizers or stemmers, because these do not have the notion of a
closed controlled vocabulary. Or if the have, the vocabulary may be very large
and it would be inconvenient to fully record it in the CAS.
If there would be an annotator which used Uby to create annotations, I could
imagine that this annotator could also add TagSet information to the CAS,
informing the user/downstream components which controlled vocabulary the "tags"
come form. I fear, though, that a user of Uby may be looking for either
something way more sophisticated that what I suggest here, e.g. recoding full
lexical entries in the CAS, or more simple, e.g. recording a link to a Uby
lexical entry in the CAS.
Original comment by richard.eckart
on 26 Jun 2013 at 10:37
>> I fear, though, that a user of Uby may be looking for either something way
more sophisticated that what I suggest here, e.g. recoding full lexical entries
in the CAS, or more simple, e.g. recording a link to a Uby lexical entry in the
CAS.
In many applications, a user might not be interested in such complex
information from Uby. So your new type might be actually useful for semantic
tagging with Uby.
We should discuss it F2F, because I aggregated some more ideas on that.
Original comment by eckle.kohler
on 26 Jun 2013 at 10:44
My primary use case right now would be to write this tagset information in a
writer.
In the TcfWriter from WebAnno, the tagset names are currently hard-coded, which
is bad. I would like to avoid having to add parameters for the tagset names and
instead read them from the CAS.
The tagset information could also be used by other writers. E.g the Negra
export format supports tagset definitions. We do not have a NegraExportWriter
yet, We have a NegraExportReader, though, which could actually read tagset
information from Negra files and record it in the CAS.
Another conceivable use-case would be to have components validate their
compatibility at runtime. We noted that the Penn Tagset used by the TreeTagger
model for English is not the same as the one expected by the Stanford parser.
This cause problems when we used the TreeTagger to create POS tags and then
used the StanfordParser only to created the constituency structure, based on
the TreeTagger POS tags. If the TreeTagger component recorded the tagset in the
CAS, the StanfordParser could look at this information and issue a warning or
error if it the tagset does not correspond to the ones expected by the parser.
Original comment by richard.eckart
on 26 Jun 2013 at 10:53
Original comment by richard.eckart
on 27 Jun 2013 at 4:31
Original comment by richard.eckart
on 4 Aug 2013 at 9:14
Original issue reported on code.google.com by
richard.eckart
on 25 Jun 2013 at 10:45