xgfs / node2vec-c

node2vec implementation in C++
MIT License
51 stars 9 forks source link

Is there any difference between node2vec-c and reference implementation of Node2Vec? #4

Open dkorkinof opened 2 years ago

dkorkinof commented 2 years ago

I was wondering if there is (algorithmically) any difference between your C++ implementation of Node2Vec and the original reference implementation from Stanford. For instance, I saw something about subsampling frequent nodes in the C++ code. I'm asking because I tried a few other Node2Vec implementations, including the one in pytorch_geometric, and have had trouble replicating the good performance of your C++ code. So I was wondering if there is anything missing in those.

xgfs commented 2 years ago

Can you link the original implementation you are mentioning explicitly? I have reimplemented the official Python code that used the Gensim library underneath. The subsampling actually comes from word2vec implementation in Gensim (for that matter, it's standard word2vec stuff). I have provided some subsampling analysis in Section 3.6 of the VERSE paper. I believe the subsampling is rather key, as it effectively shifts the positive pair distribution.

dkorkinof commented 2 years ago

ok, that helps a lot! I didn't look into the Gensim implementation at all. The two implementations I tried were this, based on Gensim, so it should be fine and the pytorch_geometric here which I'm pretty sure doesn't do any subsampling (unless that is done while sampling the walks).

xgfs commented 2 years ago

Yeah, I tried to keep it close to Python implementation (to be honest, there is not much in Python code per se, I was just reimplementing Gensim). I believe this is still one of the fastest implementations available, since I generate the walks on-the-fly and store the precomputed walk probabilities efficiently.