Not-so-good clustering in experiments

xgfs / verse

Reference implementation of the paper VERSE: Versatile Graph Embeddings from Similarity Measures

http://tsitsul.in/publications/verse/

MIT License

128 stars 22 forks source link

Not-so-good clustering in experiments #19

Open hadisfr opened 3 years ago

hadisfr commented 3 years ago

Hi! I tried to use VERSE to visualize a not-so-large (nv: 23463, ne: 35923) well-clustered graph. I used PPR version with --dim 2 (Total steps (mil): 2346.3) and then used two dimensions as x and y (after normalization) and pre-calculated cluster IDs (Louvain method) as colour to visualize the embedded graph. I ended up this: Untitled While I was expecting a visualization in which all clusters separated perfectly, as in example shown in your article. Any idea about which config should I use or what was wrong with my procedure?

hadisfr commented 3 years ago

Using 128 dimensions and then using UMAP to reduce the result to x and y, I ended up this: Figure_1 Is this a right approach? Can I make it better?

xgfs commented 3 years ago

Did you calculate the modularity of Louvain algorithm and of, say, k-means on the embedding? Are they comparable?

hadisfr commented 3 years ago

No. How can I do that? Feed the final bidimensional result of embedding to sklearn or sth? 🤔

P.S. I saw many times this approach of feeding higher dimensional embeddings of VERSE or node2vec into UMAP to get a bidimensional embedding for visualization, and it seems to work better than using e.g. VERSE to get a bidimensional embedding directly. But I don't get it. Aren't UMAP another embedding tool just liker VERSE and node2vec, only with a different approach?

xgfs commented 3 years ago

I would feed 128d embeddings personally.

Regarding 2d vs. 128d embeddings, the objective functions of UMAP or TSNE are tailored towards visualization task. VERSE is a bit different, offering similarity preservation for analysis of graphs.

hadisfr commented 3 years ago

I'll test that later this way. 🤔

Different approaches to design objective functions is an important point. I did not dig too into UMAP. Thank you!