PennTreeBank Reader for tagged corpora

GoogleCodeExporter commented 9 years ago

DKPro has yet no reader that can read the tagged plain-text corpora that comes 
along with the PTB.

Points for discussion:
- corpora contain noun phrase annotations (in addition to the tags), is there a 
type to annotate noun phrases in DKPro?

- Tokens have occasionally two or more possible part of speech tags in case of 
ambiguity, how to deal with those. Take only the first one?

- The switchboard corpus in PTB has additionally wrongly tagged words marked, 
how to deal with those. Is there a 'no-tag' attribute value for a UIMA-Pos type

Original issue reported on code.google.com by Tobias.H...@gmail.com on 1 Aug 2014 at 11:12

GoogleCodeExporter commented 9 years ago

Do you have any idea how they officially call this format? I know "Penn 
Treebank" format only as the bracketed structure.

Original comment by richard.eckart on 1 Aug 2014 at 11:34

GoogleCodeExporter commented 9 years ago

>> - corpora contain noun phrase annotations (in addition to the tags), is 
there a type to annotate noun phrases in DKPro?

one possibility is to use the type Constituent and set constituentType to 
nounPhrase

Judith

Original comment by eckle.kohler on 1 Aug 2014 at 11:40

GoogleCodeExporter commented 9 years ago

@noun phrases: I'd suggest using "Chunk" annotations

@multiple POS tags: I'd take only the first one

@no-tag: there is no 'no tag' value, but I think there is a . You could simply 
not have a POS annotation for those tags. It might cause problems with 
downstream components that expect that all tokens have a POS. You could 
consider to run a pos tagger which accepts partially pre-tagged tags to fill in 
the tags. In principle, TreeTagger could do that, but I believe the DKPro Core 
TreeTagger component does handle partially pre-tagged documents.

Original comment by richard.eckart on 1 Aug 2014 at 11:45

GoogleCodeExporter commented 9 years ago

@noun phrases: I'd suggest using "Chunk" annotations

why?
chunks and noun phrases are not the same;
for some users (e.g. me) this might be confusing

Original comment by eckle.kohler on 1 Aug 2014 at 11:56

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I don't know the name of the format, sorry. I don't think it has a name, its 
forward-slash separated token/tag plain text annotation. The NP marking in 
brackets was, I assume, the reason why they added so many line breaks - to make 
the text file more easily to process.

Following new case, they also annotated if a word was misspelled and if yes 
they added the tag it should have had if it were written correctly as in the 
example:
the/DT students/^NNS^POS parents/NNS

the missing ' caused the NNS, but it should have been students' and thus POS as 
tag.

Original comment by Tobias.H...@gmail.com on 1 Aug 2014 at 12:12

GoogleCodeExporter commented 9 years ago

>> @noun phrases: I'd suggest using "Chunk" annotations

> why? chunks and noun phrases are not the same; for some users (e.g. me) this 
might be confusing

Constituents are modeled in a hierarchy in DKPro Core (they have a 
parent/children references). 
Chunks are modelled flat in DKPro Core (they have no such references).

Original comment by richard.eckart on 1 Aug 2014 at 12:18

GoogleCodeExporter commented 9 years ago

>>Constituents are modeled in a hierarchy in DKPro Core (they have a 
parent/children references). 
>>Chunks are modelled flat in DKPro Core (they have no such references). 

ok - but what is actually annotated in the PTB: chunks or noun phrases?

if the "noun phrase" annotation is mapped to DKPro chunks, then information 
about the hierachical structure of noun phrases is lost

Original comment by eckle.kohler on 1 Aug 2014 at 12:31

GoogleCodeExporter commented 9 years ago

"Originally, each of the texts was run through PARTS (Ken Church's
stochastic part-of-speech tagger) or Eric Brill's tagger and then corrected
by a human annotator.  The square brackets surrounding phrases in the texts
are the output of a stochastic NP parser that is part of PARTS and are best
ignored."

This is how it looks like in the files:
==================================

[ Local/JJ industry/NN 's/POS investment/NN ]
in/IN 
[ Rhode/NNP Island/NNP ]
was/VBD 
[ the/DT big/JJ story/NN ]
in/IN 
[ 1960/CD 's/POS  industrial/JJ development/NN effort/NN ]
./. 
==================================

Original comment by Tobias.H...@gmail.com on 1 Aug 2014 at 12:34

GoogleCodeExporter commented 9 years ago

in the example you give, Tobias, the things in square brackets are only chunks 
- so Richard's suggestion (using Chunk) will be fine

an example of a noun phrase would be
[ Local/JJ industry/NN 's/POS investment/NN in/IN  Rhode/NNP Island/NNP ]

Original comment by eckle.kohler on 1 Aug 2014 at 12:39

GoogleCodeExporter commented 9 years ago

Ok, thx for the feedback.

Original comment by Tobias.H...@gmail.com on 1 Aug 2014 at 12:49

GoogleCodeExporter commented 9 years ago

Where should I place the new file? Project: 
de.tudarmstadt.ukp.dkpro.core.io.penntree-asl
In the same package as the parsing-related classes or open a new package?

Original comment by Tobias.H...@gmail.com on 1 Aug 2014 at 2:13

GoogleCodeExporter commented 9 years ago

Same package. Just don't call it PennTreebankReader ;) That would be the one 
for the bracketed structure. Your's should have a different name.

Original comment by richard.eckart on 1 Aug 2014 at 2:16

GoogleCodeExporter commented 9 years ago

Hm.... feel free to make suggestions, seems like my most favored name is not 
available :)

How is:

PTB[Chunked]TaggedCorpusReader

Original comment by Tobias.H...@gmail.com on 1 Aug 2014 at 2:21

GoogleCodeExporter commented 9 years ago

PennTreebankChunkedReader?

Original comment by richard.eckart on 1 Aug 2014 at 2:22

GoogleCodeExporter commented 9 years ago

I think I like your name better :)

Original comment by Tobias.H...@gmail.com on 1 Aug 2014 at 2:25

GoogleCodeExporter commented 9 years ago

ok, I committed.

Original comment by Tobias.H...@gmail.com on 1 Aug 2014 at 2:40

Changed state: Started

GoogleCodeExporter commented 9 years ago

Please merge all the test cases into one class called 
PennTreebankChunkedReaderTest.

For the parameters, please use the standard parameters from 
ComponentParameters, e.g.:

    /**
     * Location of the mapping file for part-of-speech tags to UIMA types.
     */
    public static final String PARAM_POS_MAPPING_LOCATION = ComponentParameters.PARAM_POS_MAPPING_LOCATION;
    @ConfigurationParameter(name = PARAM_POS_MAPPING_LOCATION, mandatory = false)
    protected String posMappingLocation;

For the tests, please use the DKPro Core AssertAnnotations methods, cf. 
OpenNlpParserTest and Conll2000ReaderTest.

No need to use PARAM_PATTERNS, you can merge that information into the 
PARAM_SOURCE_LOCATION unless you have multiple include/exclude patterns.

For unit tests just write "throws Exception" instead of listing each exception 
separately.

Original comment by richard.eckart on 1 Aug 2014 at 6:53

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r2673.

- Formatting / cleaning up

Original comment by richard.eckart on 2 Aug 2014 at 7:30

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r2674.

- Some formatting / cleaning up

Original comment by richard.eckart on 2 Aug 2014 at 7:36

GoogleCodeExporter commented 9 years ago

I updated the recent commit messages. Please check them out in the history to 
see how they should be written such that they also update the issue with the 
changes (see the two auto-generated comments above).

There are still various things to be fixed in the PennTreebankChunkedReader:

https://code.google.com/p/dkpro-core-asl/source/detail?r=2666

Original comment by richard.eckart on 2 Aug 2014 at 7:37

Added labels: DKPro-ASL, Module-io.penntree

GoogleCodeExporter commented 9 years ago

Ehm where/how do I see what has to be fixed?

btw. Eclipse uses auto-format .xml files that defines how code is formatted if 
the Eclipse-Key-Shortcut is used, you don't use the Eclipse default, aren't 
you? Where do I get the DKPro-Version of these files?

Original comment by Tobias.H...@gmail.com on 2 Aug 2014 at 7:47

GoogleCodeExporter commented 9 years ago

Follow the link to revision 2666 in the previous comment and check out all the 
Line-by-line comments. One of them includes a link to the Eclipse code style 
file as well.

I'm using Eclipse. I format using the keyboard-shortcut, but often I format 
only select parts of a file, not the whole file, because some lines I actually 
don't like to be auto-formatted, e.g. when I align parameter/value pairs in 
createEngineDescription(...) such that there is one pair per line.

Original comment by richard.eckart on 2 Aug 2014 at 9:02

GoogleCodeExporter commented 9 years ago

Hi Tobias,
direct link to the style xml here (from "Downloads"): 
https://code.google.com/p/dkpro-core-asl/downloads/detail?name=DKProCoreStyle_20
120326.xml&can=2&q=

Original comment by eriklan.dodinh@gmail.com on 3 Aug 2014 at 8:23

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r2675.

- Fixed value of PARAM_TAGSET in test case.

Original comment by richard.eckart on 3 Aug 2014 at 9:26

GoogleCodeExporter commented 9 years ago

ok, I saw you updated files. Is there anything left to do?

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 6:58

GoogleCodeExporter commented 9 years ago

Yes - I didn't address many of the comments that I made.

Original comment by richard.eckart on 4 Aug 2014 at 7:37

GoogleCodeExporter commented 9 years ago

Maybe I look at the wrong place, but I see nothing. If I look on the code in 
the browser I noticed that I can add comments, but I don't see any already 
attached comments? Where do I have to look?

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 7:44

GoogleCodeExporter commented 9 years ago

If you follow this link:

https://code.google.com/p/dkpro-core-asl/source/detail?r=2666

and you scroll down, you should see a section *Line-by-line comments*.

Original comment by richard.eckart on 4 Aug 2014 at 7:46

GoogleCodeExporter commented 9 years ago

hm, no. I see the section Line-by-line comments, but it says that no comments 
have been added yet. Maybe its a permission problem?

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 7:55

GoogleCodeExporter commented 9 years ago

Stupid me... I haven't used the review tool often yet and forgot to actually 
publish the review ;) Now you should be able to see them.

Original comment by richard.eckart on 4 Aug 2014 at 8:04

GoogleCodeExporter commented 9 years ago

Ok, I can see them now.

is there no pre-implemented file-loading code in the other super-class? I do 
have to reimplement the entire file loading code? What is the benefit of this 
class btw. It seems only less convenient....

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 9:46

GoogleCodeExporter commented 9 years ago

> is there no pre-implemented file-loading code in the other super-class? I do 
have to reimplement the entire file loading code? What is the benefit of this 
class btw. It seems only less convenient.... 

You mean in JCasResourceCollectionReader_ImplBase? It extends 
ResourceCollectionReaderBase (which has the loading code) but it makes sure 
that you get a JCas instead of a CAS in the getNext() method.

Original comment by richard.eckart on 4 Aug 2014 at 9:48

GoogleCodeExporter commented 9 years ago

If course your class needs to override getNext(JCas aJCas) now instead of 
getNext(CAS cas).

Original comment by richard.eckart on 4 Aug 2014 at 9:49

GoogleCodeExporter commented 9 years ago

Ah ok.

Does select(jcas, Token.class) also work if I inherit from 
JCasResourceCollectionReader_ImplBase ? Seemingly not, the method call is 
unknown. Whats wrong with JCasUtil?

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 10:15

GoogleCodeExporter commented 9 years ago

Sure, why shouldn't it work?

Original comment by richard.eckart on 4 Aug 2014 at 10:22

GoogleCodeExporter commented 9 years ago

Do you mean JCasUtil.select or a call to a method select which should have been 
inherited? The latter doesn't work.

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 10:27

GoogleCodeExporter commented 9 years ago

I  mean calling JCasUtil.select. If you turn that into a static import, you can 
just call it by "select", e.g. 

import static org.apache.uima.fit.util.JCasUtil.select;

for (Sentence sentence : select(aJCas, Sentence.class)) {...

Original comment by richard.eckart on 4 Aug 2014 at 10:28

GoogleCodeExporter commented 9 years ago

Oh ok. 

How do I set the mapped UIMA-class if I use JCas instead of CAS?

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 11:36

GoogleCodeExporter commented 9 years ago

For this aspect only you get the CAS from the JCas and do it traditionally. 
Check out e.g. BncReader.

Original comment by richard.eckart on 4 Aug 2014 at 12:08

GoogleCodeExporter commented 9 years ago

Hm, its not working. It does not set the mapped UIMA-value. Is there no method 
you can call that configures that automatically, JCas ist a bit easier to use 
but these exceptions nullifies these benefits in an instant. 

What is wrong with this code? It worked with the ResourceCollectionReader super 
class, but under JCasResourceCollectionReader_ImplBase it does not set the 
mapped value either.

 CAS aCAS = aJCas.getCas();
        posMappingProvider.configure(aCAS);

        // Token
        Type tokenType = aCAS.getTypeSystem().getType(Token.class.getName());
        AnnotationFS tokenAnno = aCAS.createAnnotation(tokenType, aCurrPosInText, aTokenText.length()
                + aCurrPosInText);
        aCAS.addFsToIndexes(tokenAnno);

        Feature feature = tokenType.getFeatureByBaseName("pos");

        // Tag
        Type posType = posMappingProvider.getTagType(aTag);
        // aCAS.getTypeSystem().getT.getFeatureByBaseName("pos");
        AnnotationFS posAnno = aCAS.createAnnotation(posType, aCurrPosInText, aTokenText.length());
        posAnno.setStringValue(posType.getFeatureByBaseName("PosValue"), aTag);
        aCAS.addFsToIndexes(posAnno);

        // Set the POS for the Token
        tokenAnno.setFeatureValue(feature, posAnno);

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 12:28

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r2679.

- Basic conversion to JCasResourceCollectionReader_ImplBase

Original comment by richard.eckart on 4 Aug 2014 at 12:37

GoogleCodeExporter commented 9 years ago

I have performed the basic conversion to JCasResourceCollectionReader_ImplBase. 
Please check out the diffs.

Original comment by richard.eckart on 4 Aug 2014 at 12:38

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r2680.

- Updated formatting

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 12:43

GoogleCodeExporter commented 9 years ago

I still don't get what is wrong with my earlier postet snippet tho....seems to 
be pretty much the same?

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 12:47

GoogleCodeExporter commented 9 years ago

Well, I'm not sure what exactly you say is not working and how you determine 
that it is not working.

Original comment by richard.eckart on 4 Aug 2014 at 1:09

GoogleCodeExporter commented 9 years ago

Never mind.

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 2:41

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Why did you undo the changes that I did to the file?

Original comment by richard.eckart on 4 Aug 2014 at 2:53

GoogleCodeExporter commented 9 years ago

I copied you 'Set the pos correctly'-code snippet into my local working copy 
and than copied my version over the DKPro one.
What was lost?

Original comment by Tobias.H...@gmail.com on 4 Aug 2014 at 2:57

GoogleCodeExporter commented 9 years ago

This issue was updated by revision r2681.

- Restoring my modifications

Original comment by richard.eckart on 4 Aug 2014 at 3:01

xiaoyangren / dkpro-core-asl

PennTreeBank Reader for tagged corpora #439