ysig / GraKeL

A scikit-learn compatible library for graph kernels
https://ysig.github.io/GraKeL/
Other
588 stars 96 forks source link

calculate graph similarity by using GraKel #59

Closed tiedaoxiaotubie closed 3 years ago

tiedaoxiaotubie commented 3 years ago

Hi, I wonder if we can use GraKel to calculate graph similarity only? I don't need to classify the graphs, I only want to get the similarity among them, I have generated graphs with networkX, since multiple strategies of graph kernel have been implemented in GraKel, I think it should have the ability to export the graph similarity, but sadly I failed to find such an API.

giannisnik commented 3 years ago

Hi @tiedaoxiaotubie ,

Assuming that the nodes of your graphs are not annotated with attributes, you can transform them into grakel graphs as follows:

from grakel.utils import graph_from_networkx
G = graph_from_networkx(G_nx)

where G_nx is a list of networkx graphs.

Then, you can compute the n x n kernel matrix (where n is the number of graphs). You can think of the normalized kernel values as the similarities between graphs. Thus, the (i,j)-th element of the kernel matrix corresponds to the similarity between the i-th graph and the j-th graph. To produce the kernel matrix you can use one of the kernels that is implemented in grakel. For instance, you can compute the shortest path kernel as follows:

from grakel.kernels import ShortestPath
gk = ShortestPath(normalize=True)
K = gk.fit_transform(G)

Then, the (i,j)-th element of K contains the similarity between graphs i and j.

tiedaoxiaotubie commented 3 years ago

Thank you very much for your reply! the nodes of my graphs are neither annotated with attributes nor with labels, I tried with your code, but seems not work for me. I have to admit that my graph is pretty large, in which the edge number is more than 6,000, maybe the GraKel cannot handle such a big graph in one time?

I tried to use ShortestPath first, it raised an error immediately:

(base) symexe@ubuntu1804:~/CFGFactory$ python3 CFGenerator.py
Traceback (most recent call last):
  File "CFGenerator.py", line 68, in <module>
    main()
  File "CFGenerator.py", line 62, in main
    C._cfg_similarity()
  File "CFGenerator.py", line 44, in _cfg_similarity
    K = gk.fit_transform(G)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/kernel.py", line 197, in fit_transform
    km = self._calculate_kernel_matrix()
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/kernel.py", line 231, in _calculate_kernel_matrix
    K[i, i] = self.pairwise_operation(x, x)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/random_walk.py", line 206, in pairwise_operation
    XY = np.kron(X, Y)
  File "<__array_function__ internals>", line 6, in kron
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 1154, in kron
    result = outer(a, b).reshape(as_+bs)
  File "<__array_function__ internals>", line 6, in outer
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/numpy/core/numeric.py", line 942, in outer
    return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
numpy.core._exceptions.MemoryError: Unable to allocate 8.83 PiB for an array with shape (35259844, 35259844) and data type float64

I searched the previous issue that similar to my question, I noticed there were people tried to use RandomWalk to calculate the similarity:https://github.com/ysig/GraKeL/issues/8, so I also gave it a try, my code are as follows:

    def _cfg_similarity(self):
        # Transforms list of NetworkX graphs into a list of GraKeL graphs
        G = graph_from_networkx(self._cfg_all)

        # Uses the shortest path kernel to generate the kernel matrices
        gk = RandomWalk(method_type="baseline", lamda=0.01)
        K = gk.fit_transform(G)
        print(K)

at this time, my error message was:

(base) symexe@ubuntu1804:~/CFGFactory$ python3 CFGenerator.py
Traceback (most recent call last):
  File "CFGenerator.py", line 68, in <module>
    main()
  File "CFGenerator.py", line 62, in main
    C._cfg_similarity()
  File "CFGenerator.py", line 44, in _cfg_similarity
    K = gk.fit_transform(G)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/kernel.py", line 197, in fit_transform
    km = self._calculate_kernel_matrix()
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/kernel.py", line 231, in _calculate_kernel_matrix
    K[i, i] = self.pairwise_operation(x, x)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/random_walk.py", line 206, in pairwise_operation
    XY = np.kron(X, Y)
  File "<__array_function__ internals>", line 6, in kron
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 1154, in kron
    result = outer(a, b).reshape(as_+bs)
  File "<__array_function__ internals>", line 6, in outer
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/numpy/core/numeric.py", line 942, in outer
    return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
numpy.core._exceptions.MemoryError: Unable to allocate 8.83 PiB for an array with shape (35259844, 35259844) and data type float64
giannisnik commented 3 years ago

It seems that not only the graphs are large, but also the dataset itself (35,259,844 graphs in total?). Comparing every pair of graphs is not feasible since it requires generating a 35,259,844 x 35,259,844 matrix which does not fit in memory. Do you need to compare each one of the 35,259,844 graphs against every other of the dataset?

tiedaoxiaotubie commented 3 years ago

OK, I changed the number of graphs to be calculated, only select 4 of the original graphs, but there are still more than 6,000 edges and 5,000 nodes for each of the graphs, at this time, the RandomWalk worked, while the ShortestPath still failed. The error message of ShortestPath are as follows:

(base) symexe@ubuntu1804:~/CFGFactory$ python3 CFGenerator.py
Traceback (most recent call last):
  File "CFGenerator.py", line 68, in <module>
    main()
  File "CFGenerator.py", line 62, in main
    C._cfg_similarity()
  File "CFGenerator.py", line 44, in _cfg_similarity
    K = gk.fit_transform(G)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/shortest_path.py", line 393, in fit_transform
    self.fit(X)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/kernel.py", line 123, in fit
    self.X = self.parse_input(X)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/shortest_path.py", line 460, in parse_input
    labels=self._lt)
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/graph.py", line 688, in build_shortest_path_matrix
    return shortest_path_mat, self.get_labels()
  File "/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/graph.py", line 737, in get_labels
    raise ValueError('Graph does not have any labels for vertices.')
ValueError: Graph does not have any labels for vertices.

It seems trying to get labels from the graph, but the graphs of mine are neither annotated with attributes nor with labels, which makes me confused.

The output of RandomWalk are as follows:

(base) symexe@ubuntu1804:~/CFGFactory$ python3 CFGenerator.py
/home/symexe/miniconda3/lib/python3.7/site-packages/grakel/kernels/kernel.py:201: RuntimeWarning: invalid value encountered in sqrt
  return km / np.sqrt(np.outer(self._X_diag, self._X_diag))
[[-1.                 nan         nan         nan]
 [        nan  1.          1.07683489  0.89696687]
 [        nan  1.07683489  1.          1.09279374]
 [        nan  0.89696687  1.09279374  1.        ]]

For the output of RandomWalk, I have 2 questions: 1) why there are -1 and nan? 2) I noticed the GPU was not used when during the script's execution, is it normal? I thought it should use GPU at least :)

Hope I could get your answer, thank you very much :)

giannisnik commented 3 years ago

With regards to the two questions:

  1. I think that this is related to the lamda parameter of the kernel. Probably, you need to set this to some smaller value than the default. To obtain values between 0 and 1, set also normalize=True.
  2. The code does not use the GPU. The kernels run on the CPU.

With regards to the shortest path kernel, it expects node-labeled graphs. You can set the with_labels attribute to False as follows: gk = ShortestPath(with_labels=False, normalize=True)

Perhaps the most efficient kernel is the WL subtree. But this kernel expects the graphs to contain node labels. To run the kernel, you can assign the same label to all nodes of all graphs (for instance, the label 'a'). To do this, have a look at this example: https://ysig.github.io/GraKeL/0.1a8/auto_examples/nx_to_grakel.html#sphx-glr-auto-examples-nx-to-grakel-py

tiedaoxiaotubie commented 3 years ago

Perfect answer, awesome! I got better result after using a smaller lamda in RandomWalk. the error of shortest path kernel also was handled after adding with_labels=False.

tiedaoxiaotubie commented 1 year ago

您的邮件我已经收到了,我会尽快处理然后给你回复。谢谢!