xiaohuiyan / BTM

Code for Biterm Topic Model (published in WWW 2013)
https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
Apache License 2.0
405 stars 137 forks source link

Treating empty documents #5

Closed usptact closed 8 years ago

usptact commented 8 years ago

Hi,

If some of the documents are empty (empty line in input file), the output in the corresponding pz_d file is all -nan. Of course this is a border case which can be easily dealth with by removing such "documents".

The empty documents can arise for short documents composed only of stopwords. After stopword removal the document is empty.

Thanks for writing the code and sharing it.

xiaohuiyan commented 8 years ago

The empty documents are supposed be filter out during pre-processing, Or simply set P(z|d) = P(z).