xgfs / verse

Reference implementation of the paper VERSE: Versatile Graph Embeddings from Similarity Measures
http://tsitsul.in/publications/verse/
MIT License
128 stars 22 forks source link

question on input format #4

Closed jwijffels closed 6 years ago

jwijffels commented 6 years ago

I'm seeing how to build an R wrapper around this. Can you explain a bit the BCSR format in words. What I'm trying to do is to understand what is in the offsets & edges c++ code you have there. Can you explain that in words so that I am 100% sure what they should be if I passed the data from R to C++

offsets = static_cast<int *>(aligned_malloc((nv + 1) * sizeof(int32_t), DEFAULT_ALIGN));
edges = static_cast<int *>(aligned_malloc(ne * sizeof(int32_t), DEFAULT_ALIGN));
embFile.read(reinterpret_cast<char *>(offsets), nv * sizeof(int32_t));
offsets[nv] = (int)ne;
embFile.read(reinterpret_cast<char *>(edges), sizeof(int32_t) * ne);
xgfs commented 6 years ago

I'd happily include/reference an R wrapper.

BCSR is essentially CSR (python reference since I'm not an R user, but the standard is universal). If you have any (unweighted) graph in sparse matrix format, there should be a way to convert it to CSR. Then, 'indices' in the description of the format correspond to 'offsets' in the code, and 'edges' in the code correspond to 'indptr'.

jwijffels commented 6 years ago

Thank you for the input. I'll try to build an R wrapper around this.

jwijffels commented 6 years ago

Are you sure this is CSR format and not CSC format which is implemented? Or that the statement in the answer above should have been "Then, 'indices' in the description of the format correspond to 'edges' in the code, and 'offsets' in the code correspond to 'indptr'."

xgfs commented 6 years ago

"Then, 'indices' in the description of the format correspond to 'edges' in the code, and 'offsets' in the code correspond to 'indptr'."

Yes, sorry.

jwijffels commented 6 years ago

Ok thanks for the confirmation.

jwijffels commented 6 years ago

FYI. I've build the R wrapper which works fine and returns the embeddings. I've put the R package here https://www.datatailor.be/open-source/deepwalker for the time being. I'm stuck however on the 3 different similarity implementations. If I'm correct verse.cpp is pagerank, verse_simrank.cpp is simrank from the paper and verse_neigh.cpp is adjacency similarity. When I try to include all 3 neighbourhood metrics in the R package, R complains because there are multiple definitions for each function - no wonder, the c++ functionality seems to be copied. How can this be solved optimally?

verse.o:verse.cpp:(.text+0xe0): multiple definition of `lrand()'
verse-simrank.o:verse-simrank.cpp:(.text+0xe0): first defined here
verse.o:verse.cpp:(.text+0x120): multiple definition of `init_sigmoid_table()'
verse-simrank.o:verse-simrank.cpp:(.text+0x120): first defined here
verse.o:verse.cpp:(.text+0x200): multiple definition of `FastSigmoid(float)'
verse-simrank.o:verse-simrank.cpp:(.text+0x200): first defined here
verse.o:verse.cpp:(.text+0x250): multiple definition of `Train()'
verse-simrank.o:verse-simrank.cpp:(.text+0x250): first defined here
verse.o:verse.cpp:(.bss+0x0): multiple definition of `rng_seed'
verse-simrank.o:verse-simrank.cpp:(.bss+0x0): first defined here
verse.o:verse.cpp:(.bss+0x10): multiple definition of `sigmoid_table'
verse-simrank.o:verse-simrank.cpp:(.bss+0x10): first defined here
verse.o:verse.cpp:(.bss+0x40): multiple definition of `nv'
verse-simrank.o:verse-simrank.cpp:(.bss+0x40): first defined here
verse.o:verse.cpp:(.data+0x4): multiple definition of `n_samples'
verse-simrank.o:verse-simrank.cpp:(.data+0x4): first defined here
verse.o:verse.cpp:(.bss+0x18): multiple definition of `w0'
verse-simrank.o:verse-simrank.cpp:(.bss+0x18): first defined here
verse.o:verse.cpp:(.bss+0x48): multiple definition of `step'
verse-simrank.o:verse-simrank.cpp:(.bss+0x48): first defined here
verse.o:verse.cpp:(.data+0x8): multiple definition of `n_hidden'
verse-simrank.o:verse-simrank.cpp:(.data+0x8): first defined here
verse.o:verse.cpp:(.bss+0x50): multiple definition of `total_steps'
verse-simrank.o:verse-simrank.cpp:(.bss+0x50): first defined here
verse.o:verse.cpp:(.bss+0x30): multiple definition of `offsets'
verse-simrank.o:verse-simrank.cpp:(.bss+0x30): first defined here
verse.o:verse.cpp:(.bss+0x28): multiple definition of `edges'
verse-simrank.o:verse-simrank.cpp:(.bss+0x28): first defined here
verse.o:verse.cpp:(.data+0x0): multiple definition of `ppralpha'
verse-simrank.o:verse-simrank.cpp:(.data+0x0): first defined here
verse.o:verse.cpp:(.data+0x10): multiple definition of `global_lr'
verse-simrank.o:verse-simrank.cpp:(.data+0x10): first defined here
verse.o:verse.cpp:(.bss+0x58): multiple definition of `silent'
verse-simrank.o:verse-simrank.cpp:(.bss+0x58): first defined here
verse.o:verse.cpp:(.data+0x14): multiple definition of `n_threads'
verse-simrank.o:verse-simrank.cpp:(.data+0x14): first defined here
verse.o:verse.cpp:(.data+0xc): multiple definition of `n_epochs'
verse-simrank.o:verse-simrank.cpp:(.data+0xc): first defined here
verse.o:verse.cpp:(.bss+0x38): multiple definition of `ne'
verse-simrank.o:verse-simrank.cpp:(.bss+0x38): first defined here
verse.o:verse.cpp:(.bss+0x20): multiple definition of `degrees'
verse-simrank.o:verse-simrank.cpp:(.bss+0x20): first defined here
xgfs commented 6 years ago

I will verify the correctness of the wrapper soon, and then would ask you to submit a pull request (or I do it myself), if you are not against it.

As for the inclusion, three programs only have a different Train() function, one can probably refactor the code to include three different cores, and the rest will be all the same.

jwijffels commented 6 years ago

Also these elements have the same name: rng_seed' sigmoid_table' nv' n_samples' w0' step' n_hidden' total_steps' offsets' edges' ppralpha' global_lr' silent' n_threads' n_epochs' ne' `degrees' I wonder how this should be refactored? Any idea?

xgfs commented 6 years ago

I mean, they could be shared, as their implementation is the same, and some logic inside the wrapper could raise a warning if the method is called twice, meaning the embeddings are not randomized.

jwijffels commented 6 years ago

Currently I only could include 1 of the 3 (verse.cpp) as including the other 2 give conflicts (shown above). This might need some refactoring if I would like to also include verse, deepwalk and node2vec in the same R package.

xgfs commented 6 years ago

I'm currently re-writing the code for DeepWalk, n2v and Verse to allow for such library. Please wait a little bit, as the architecture will change a little (i.e. allocations are managed by the Python/R code).

xgfs commented 6 years ago

The reasoning is essentially since it's a research work, people must be able to reproduce it, including the scalability. If I outsource some parts to Python, I'll inevitably lose some speed due to different memory layouts. Code as is is extremely inflexible t obe used as a library.

jwijffels commented 6 years ago

Yes, I understand. Maybe this can be rewritten with classes or templates in order to allow more easily to be used by python/R while still keeping the speed? For what it is worth, the code which I written to wrap it looks like this: https://www.datatailor.be/open-source/deepwalker/blob/master/src/verse.cpp