wrathematics / ngram

Fast n-Gram Tokenization
Other
71 stars 24 forks source link

Question about tokenization of separate sentences in one string. #4

Closed JosephPotashnik closed 8 years ago

JosephPotashnik commented 8 years ago

Hi,

I love the ngram library! thank you!

May I ask how to make the ngram tokenizer treat different sentences (separated by , say, "," , ";", ".", "-", "(", ")" and such) in the input string. That is, that the last word of the previous sentence won't create a bigram with the first word of the next sentence.

example: str = ("John loves apples. John loves cakes") bigrams = { "John loves, loves apples, loves cakes" }

I am a new user to R so there may be an obvious way in the documnetation which I have missed. Thank you

heckendorfc commented 8 years ago

Basically you'll want to split the string into sentences first. Then pass the character vector containing the two separate sentences to ngram, asking it to split by spaces (the default). preprocess() might also be useful depending on how messy your input is.

> x <- "John loves apples. John loves cakes"
> splitx <- unlist(strsplit(x,". ",fixed=T))
> splitx
[1] "John loves apples" "John loves cakes" 
> ng <- ngram(splitx,2)
> get.phrasetable(ng)
         ngrams freq prop
1   John loves     2 0.50
2 loves apples     1 0.25
3  loves cakes     1 0.25

Does that make sense?

JosephPotashnik commented 8 years ago

Perfect. Thank you kindly!