Preprocessing to remove low-information words

From http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf (Blei & Lafferty 2009):

Choosing the vocabulary.It is often computationally expensive to usethe entire vocabulary. Choosing the topVwords by TFIDF is an effectiveway to prune the vocabulary. This naturally prunes out stop words and otherterms that provide little thematic content to the documents. In theScienceanalysis above we chose the top 10,000 terms this way

"TFIDF" = term frequency/inverse document frequency.

Could implement this using bind_tf_idf from tidytext: https://www.tidytextmining.com/tfidf.html

From https://www.pnas.org/content/101/suppl_1/5228 (Griffiths & Stevyers 2004):

Any delimiting character, including hyphens, was used to separate words, and we deleted any words that occurred in less than five abstracts or belonged to a standard “stop” list used in computational linguistics, including numbers, individual characters, and some function words.

They filtered words occurring in <5 documents.

Transients:

From: https://esajournals.onlinelibrary.wiley.com/doi/epdf/10.1002/ecy.2398 (Snell et al 2018)

Following Coyle et al. (2013), we operationally defined aspecies as transient at a site if it was observed in 33% orfewer of the temporal sampling intervals, and assessed theprevalence of transients as the proportion of species in theassemblage below this threshold (Fig. 1A). We also evalu-ated more restrictive definitions using maximum temporaloccupancy thresholds of 10% and 25% to evaluate theimpact of this decision. Results were qualitatively similar forthe three different thresholds (Appendix S3: Figs S3–S6)

Try 30%?

weecology / MATSS-LDATS

Preprocessing to remove low-information words #37