yongzhuo / nlg-yongzhuo

中文文本生成(NLG)之文本摘要(text summarization)工具包, 语料数据(corpus data), 抽取式摘要 Extractive text summary of Lead3、keyword、textrank、text teaser、word significance、LDA、LSI、NMF。(graph,feature,topic model,summarize tool or tookit)
https://blog.csdn.net/rensihui
MIT License
405 stars 53 forks source link

Question about sklearn's lda result #16

Closed FengMu1995 closed 7 months ago

FengMu1995 commented 7 months ago

topic_lda.py ... res_ldav = lda.components ... "lda.componets" represent words distribution about the topic. However, ''' else:

方案二, 获取最大主题概率的句子, 不分主题

        res_combine = {}
        for i in range(len_sentences_cut):
            res_row_i = res_lda_v[:, i]
            res_row_i_argmax = np.argmax(res_row_i)
            res_combine[self.sentences[i]] = res_row_i[res_row_i_argmax]

''' 所以用res_lda_v[:, i]取出来的是单词的概率, 但最后计算的是句子的概率,所以这个地方我有些疑问,句子是怎么抽取的?是不是在前面用lda算的时候就是用的句子呢

yongzhuo commented 7 months ago

不是单词, 算的就是句子的,lda输入的是 tf_ngram.T

self.sentences_cut = [" ".join(sc) for sc in self.sentences_cut]
# 计算每个句子的tf
vector_c = CountVectorizer(ngram_range=(1, 2), stop_words=self.stop_words)
tf_ngram = vector_c.fit_transform(self.sentences_cut)
...
res_lda_u = lda.fit_transform(***tf_ngram.T***)
FengMu1995 commented 7 months ago

不是单词, 算的就是句子的,lda输入的是 tf_ngram.T

self.sentences_cut = [" ".join(sc) for sc in self.sentences_cut]
# 计算每个句子的tf
vector_c = CountVectorizer(ngram_range=(1, 2), stop_words=self.stop_words)
tf_ngram = vector_c.fit_transform(self.sentences_cut)
...
res_lda_u = lda.fit_transform(***tf_ngram.T***)

想问下res_lda_u和res_lda_v哪个是文档主题分布,哪个是主题-词分布

yongzhuo commented 7 months ago

res_lda_v是文档主题分布

FengMu1995 commented 7 months ago

那就说的通了, 大佬是把一个句子作为lda中的"文档"这个思路抽取的吧