How does BTM create its biterm sets?

xiaohuiyan / BTM

Code for Biterm Topic Model (published in WWW 2013)

https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

Apache License 2.0

405 stars 137 forks source link

How does BTM create its biterm sets? #10

Open LjessonS opened 7 years ago

LjessonS commented 7 years ago

Recently, I'm interested in your idea of model data on word-pairs in a document for short texts, but I'm a bit of confused at how you count the biterm sets in BTM. You did a nice job to implement it in C++, but I'm not good at it, and feel hard to read c++ code. I wonder if counts of every word-pairs within a document is one, and the biterm vector of the whole biterm sets can be updated by calculating the word pairs from document to document. Wish you to answer my puzzle. Thank you very much!

xiaohuiyan commented 7 years ago

Not exactly right. A biterm is defined as a pair of words co-occurring in the same text window. For example, a doc is "A B C B ", and suppose the window size=3, so their are two text windows which can generate biterms as follows:

text window "A B C" => "A B", "B C", "A C"
text window "B C B" => "B C", "C B", "B B" Since a biterm is an unorder word pair, "B C"="C B". Thus, the doc will count the biterm "B C" 3 times, and the biterms "A B", "A C", "B B" 1 time.

PS: Thanks to other contributors, you can find the implementation of BTM with other language (e.g, python, julia, scala) on github :)

himanshi-sinha commented 7 years ago

Hi could you please provide the link for the python implementation for BTM.

rtrad89 commented 4 years ago

Hi could you please provide the link for the python implementation for BTM.

Here.