What is `W` Precisely? - Githubissues

xiaohuiyan / BTM

Code for Biterm Topic Model (published in WWW 2013)

https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

Apache License 2.0

405 stars 138 forks source link

What is `W` Precisely? #27

Open rtrad89 opened 4 years ago

rtrad89 commented 4 years ago

During topic learning, one needs to supply W: int, size of vocabulary.

I tried to fathom the meaning of W reading Algorithm 1: Gibbs sampling algorithm for BTM in the paper BTM: Topic Modeling over Short Texts, but W is not an input there. However, it is data-dependent to me, so am I correct if I assume W to mean the number of unique terms in the cleaned and preprocessed corpus? If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question.

Thank you.

zhongpeixiang commented 4 years ago

During topic learning, one needs to supply W: int, size of vocabulary.

I tried to fathom the meaning of W reading Algorithm 1: Gibbs sampling algorithm for BTM in the paper BTM: Topic Modeling over Short Texts, but W is not an input there. However, it is data-dependent to me, so am I correct if I assume W to mean the number of unique terms in the cleaned and preprocessed corpus? If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question.

Thank you.

W denotes the vocab size.

W=`wc -l < $voca_pt` # vocabulary size

rtrad89 commented 4 years ago

During topic learning, one needs to supply W: int, size of vocabulary. I tried to fathom the meaning of W reading Algorithm 1: Gibbs sampling algorithm for BTM in the paper BTM: Topic Modeling over Short Texts, but W is not an input there. However, it is data-dependent to me, so am I correct if I assume W to mean the number of unique terms in the cleaned and preprocessed corpus? If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question. Thank you.

W denotes the vocab size.
W=`wc -l < $voca_pt` # vocabulary size

Can you clarify this then please?

If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question.

zhongpeixiang commented 4 years ago

$voca_pt is the vocab file automatically calculated from $doc_pt. See

python indexDocs.py $doc_pt $dwid_pt $voca_pt