ysig / GraKeL

A scikit-learn compatible library for graph kernels
https://ysig.github.io/GraKeL/
Other
587 stars 96 forks source link

Which algorithm can be used when both node and edge are labeled? #62

Closed tiedaoxiaotubie closed 3 years ago

tiedaoxiaotubie commented 3 years ago

Hi, I want to use GraKel to compare program's function call graph similarity, each caller address is a node, I use the caller address itself as the node's label, besides, loop always exist in program execution, thus one edge may be executed multiple times before the program exit, so I use execution time as edge label.

I want to know whether there is an algorithm in GraKel can handle the situation above? I checked the example, but it only shows the graph with labeled node, I also tried to use WeisfeilerLehman algorithm, but it seems this algorithm won't take edge label into consideration :(

giannisnik commented 3 years ago

Hi @tiedaoxiaotubie ,

The NeighborhoodSubgraphPairwiseDistance kernel can take both node and edge labels into account. You can have a look at this example that uses this kernel: https://github.com/ysig/GraKeL/blob/master/tutorials/digit_classification/digit_classification.ipynb

Note that the node and edge labels need to be discrete. You mentioned that you use execution time as edge label. Does that mean that these labels are real-valued? If this is the case, the kernel would not be able to handle them. One easy workaround is to apply some binning approach and map these execution times to integers.

tiedaoxiaotubie commented 3 years ago

Thanks for your answer, that's quite helpful! But I still have 3 questions: 1) what's the different between label and attribute? it seems attribute must be a vector 2) In the link you sent to me, I noticed weisfeiler_lehman doesn't ignore the edge label, but depends on the base_kernel, does it means we can still use weisfeiler_lehman in our situation? 3) how to transform NetworkX edge labeled graphs to GraKeL graphs? In example code, only shows the node labeled situation, I use the following instruction:

G = graph_from_networkx(cfg_all, node_labels_tag='label', edge_labels_tag='execute_counts')

am I right?

giannisnik commented 3 years ago
  1. Labels are discrete and each label is independent from the others. On the other hand, attributes can be one-dimensional or multi-dimensional (i.e., vectors) and they are not independent from each other (for instance, 10.0 is more similar to 8.0 than to 1.0).
  2. The Weisfeiler-Lehman framework expects node-labeled graphs and applies an iterative procedure that updates these node labels. As you mention, it can work on top of other kernels, however, it cannot be applied to graphs that do not contain node labels.
  3. Yes, what you are doing is correct.
tiedaoxiaotubie commented 3 years ago

Neat! I just tried Weisfeiler-Lehman to calculate the similarity, I found no matter I label edge or not, the graph similarity are always same! That's weird.

Besides, I though the the number of iterations in Weisfeiler-Lehman will affect the result monotonic, but when I add the iterations, the precision didn't rise, so the relationship between iterations number and graph similarity is not monotonic?

giannisnik commented 3 years ago

The kernel values are supposed to increase as the number of iterations increase (if not normalized). But it was not clear to me if you exactly meant that. Or do you mean the accuracy (or some other evaluation metric) on the test set of some dataset? Because these two are not related to each other.

tiedaoxiaotubie commented 3 years ago

Oh, I forgot to mention that, in fact the data are normalized when I used :

gk = WeisfeilerLehman(n_iter=1, base_graph_kernel=VertexHistogram, normalize=True)

suppose you have several graphs that are very same, after increasing the number of iterations, will the graph similar metrics become more and more high? for example, the graph similarity may be 70% when you iterate 4 times, if you increate the iterate number from 4 to 10, the graph similarity will always higher than 70%?

17853535785 commented 3 years ago

Hello, I now have a question similar to yours: based on the use of Weisfeiler-Lehman to calculate the similarity of graph structure, I want to use it to calculate the similarity of graphs with weights (edges of the graph). I don't know if it can be implemented directly using GraKel, or can it be changed on this basis?

giannisnik commented 3 years ago

Hi @17853535785 ,

This has been discussed in a previous thread: https://github.com/ysig/GraKeL/issues/37

giannisnik commented 3 years ago

Dear @tiedaoxiaotubie

With regards to your last question, by increasing the number of iterations of the Weisfeiler-Lehman subtree kernel, it is not necessary that the normalized kernel values will keep increasing. Here is a counter-example: suppose we are given a graph G1 that corresponds to the path P5 (path consisting of 5 vertices) and a graph G2 that is the disjoint union of a the cycle C3 (triangle) and P2 (path consisting of 2 vertices). We annotate the vertices of both graphs with the same label (e.g., 1). After the first iteration of the Weisfeiler-Lehman subtree kernel, the label sets of the two graphs are identical, thus the normalized kernel value is equal to 1. At the end of the second interation, the label sets of the two graphs are not identical, thus the normalized kernel value will be smaller than 1.