wrathematics / ngram

Fast n-Gram Tokenization
Other
71 stars 24 forks source link

Tweak to multiread to ensure files are returned when prune.empty = TRUE #8

Closed russey closed 4 years ago

russey commented 4 years ago

Firstly, thank you for the excellent package. I just stumbled upon this while using multiread().

When prune.empty = TRUE, any empty files will be dropped before files are returned. I believe this works as intended when an empty file exists. However, when no empty files exists an empty list will be returned, dropping other non-empty files.

An example using the ngram R directory:

> multiread(path = "R", extension = ".R")
named list()

Line 74 looks to be the cause text = text[-which(text == "")].

Works as expected when an empty file (string) is present:

> text <- c("hello", "hi", "")
> text[-which(text == "")]
[1] "hello" "hi"

Works not as expected when an empty file (string) is not present:

> text <- c("hello", "hi")
> text[-which(text == "")]
character(0)

After the fix, empty files still pruned:

> text <- c("hello", "hi", "")
> text[text != ""]
[1] "hello" "hi"

And non-empty files are still returned when no empty files:

> text <- c("hello", "hi")
> text[text != ""]
[1] "hello" "hi"

Many thanks, Joe

wrathematics commented 4 years ago

Thanks!