Closed zhangatao closed 4 years ago
一个 impression 里面一般正例要远远少于负例(即一个 impression 里的 candidate news 中,大部分都没有被点击)。
没有做 Negative sampling 的情况下,就是一个普通的二分类问题,每个 impression 里面的每个 candidate news 都会生成一个训练样本,导致最终的正例远远少于负例。可以参考 https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18。
做了 Negative sampling 的情况下(这个 repo 里面的 model 代码都做了),是把一个正例和 K 个负例当成一个 pair,loss 的表达式反映着整个 pair 的匹配程度。这个时候的 balance 指的就是在一个 impression 的 candidate news 中,给每个正例匹配 K 个负例后,将多余的负例丢弃。这部分可以找篇本 repo 的 model 的 paper 来看看。没记错的话,除了 DKN,其他 paper 里面都介绍了 Negative sampling。
感谢回复,我再看下论文,谢谢🙏
打算研究一下楼主的代码,在这个基础上把预训练的内容给加上去,我看楼主在embedding的时候用的还是glove。另外楼主的这个项目很不错,再次感谢了🙏
刚刚看了下recommenders中关于新闻推荐的代码,在他的源代码也看到了Negative sampling的处理了。
@yusanshi can you please share pretrained weight for this model and one more thing please let me know which config you used before training and evaluation. Thanks
@ayush-angelium
can you please share pretrained weight for this model
Sorry but in fact I don't have them... Months ago, I trained and tested all the methods on MIND small dataset and shown the results and checkpoint links in README.md
. However, I have made some small changes to the code and I began to use MIND large dataset. So I removed the outdated results. But I haven't trained and tested on the MIND large dataset.
let me know which config you used before training and evaluation
Just those in src/config.py
.
你好,在数据处理的部分,为啥要对数据进行balance处理,有什么讲究吗