Sentence Annotation - Githubissues

maebert commented 8 years ago

We will likely need several different representations of sentences that might be FRDs for various features:

[ ] Stripped of all punctuation & accents, lower case, target term replaced with _TERM_
[ ] Tokenised and POS-Tagged (Penn Treebank)
[ ] Single feature marker for sentences, such as number of sub-clauses (os something that actually makes sense)

Since computing POS tags is rather CPU intensive, so I'd store POS tags on the sentence attribute of each message, too.

The sentence annotation should take a sentence as a string, and the term as a string:

def annotate(sentence, term):
    ...

and return a dictionary with at least the following:

{
    "s": "A kalyptic culture is typified by peacefulness, tolerance and individualism.",
    "s_clean": "a _TERM_ culture is typified by peacefulness tolerance and individualism"
    "pos_tags": "A/DT _TERM_/JJ culture/NN is/VBZ typified/VBN by/IN peacefulness/NN ,/, tolerance/NN and/CC individualism/NN ./."
    "features" : {
      ...
    }
}

clarecorthell commented 8 years ago

Rm bold/italic. These will be captured with a list of tokens for each at the document level.

maebert commented 8 years ago

@clarecorthell Thanks, updated tis and #12 accordingly

maebert commented 8 years ago

Fixed in #44

wordnik / serapis

Sentence Annotation #11