wrathematics / ngram

Fast n-Gram Tokenization
Other
71 stars 23 forks source link

Error in ngram(text, n = 4) : input 'str' has nwords=3 and n=4; must have nwords >= n #10

Open arthur0421 opened 1 year ago

arthur0421 commented 1 year ago

text <- scan("ca10.txt", what = "char", sep = "\n") # ca10.txt is a file in the Brown corpus text <- tolower(text) text <- gsub("[^a-z- ]", "", text, perl = T) quad <- get.phrasetable(ngram(text, n = 4))

This last line croaks the error msg. I don't understand why it says nwords=3 which is obviously untrue. Guess it's because one line in the file contains only three tokens? How can I work around this issue? (BTW, I work with R 3.6.3 on Linux Mint 19.3.) ca10.txt

heckendorfc commented 1 year ago

I think you're right. To bypass, you could pass text[-50] to exclude that 3-word line from your input.