Choosing the vocabulary.It is often computationally expensive to usethe entire vocabulary. Choosing the topVwords by TFIDF is an effectiveway to prune the vocabulary. This naturally prunes out stop words and otherterms that provide little thematic content to the documents. In theScienceanalysis above we chose the top 10,000 terms this way
"TFIDF" = term frequency/inverse document frequency.
Any delimiting character, including hyphens, was used to separate words, and we deleted any words that occurred in less than five abstracts or belonged to a standard “stop” list used in computational linguistics, including numbers, individual characters, and some function words.
Following Coyle et al. (2013), we operationally defined aspecies as transient at a site if it was observed in 33% orfewer of the temporal sampling intervals, and assessed theprevalence of transients as the proportion of species in theassemblage below this threshold (Fig. 1A). We also evalu-ated more restrictive definitions using maximum temporaloccupancy thresholds of 10% and 25% to evaluate theimpact of this decision. Results were qualitatively similar forthe three different thresholds (Appendix S3: Figs S3–S6)
From http://www.cs.columbia.edu/~blei/papers/BleiLafferty2009.pdf (Blei & Lafferty 2009):
"TFIDF" = term frequency/inverse document frequency.
Could implement this using
bind_tf_idf
fromtidytext
: https://www.tidytextmining.com/tfidf.htmlFrom https://www.pnas.org/content/101/suppl_1/5228 (Griffiths & Stevyers 2004):
They filtered words occurring in <5 documents.
Transients:
From: https://esajournals.onlinelibrary.wiley.com/doi/epdf/10.1002/ecy.2398 (Snell et al 2018)
Try 30%?