NeighborhoodSubgraphPairwiseDistance kernel returns diagonal elements less than 1

altsoph commented 10 months ago

Describe the bug

NeighborhoodSubgraphPairwiseDistance kernel returns diagonal elements less than 1.

To Reproduce

Small snippet to reproduce:

import grakel
from grakel import Graph
from grakel.kernels import NeighborhoodSubgraphPairwiseDistance

print(grakel.__version__)

g1 = Graph([[0., 0., 0., 0., 1., 1.],
            [0., 0., 1., 1., 0., 0.],
            [0., 1., 0., 1., 0., 0.],
            [0., 1., 1., 0., 1., 0.],
            [1., 0., 0., 1., 0., 0.],
            [1., 0., 0., 0., 0., 0.]], 
           node_labels=[1,1,1,1,1,1],
           edge_labels={(0, 4): 'B',
                        (0, 5): 'B',
                        (1, 2): 'B',
                        (1, 3): 'B',
                        (2, 1): 'B',
                        (2, 3): 'B',
                        (3, 1): 'B',
                        (3, 2): 'B',
                        (3, 4): 'B',
                        (4, 0): 'B',
                        (4, 3): 'B',
                        (5, 0): 'B'})

g2 = Graph([[0., 0., 0., 1., 1., 1.],
            [0., 0., 1., 0., 0., 0.],
            [0., 1., 0., 0., 0., 0.],
            [1., 0., 0., 0., 0., 1.],
            [1., 0., 0., 0., 0., 1.],
            [1., 0., 0., 1., 1., 0.]],
           node_labels=[1,1,1,1,1,1],
           edge_labels={(0, 3): 'B',
                        (0, 4): 'B',
                        (0, 5): 'B',
                        (1, 2): 'B',
                        (2, 1): 'B',
                        (3, 0): 'B',
                        (3, 5): 'B',
                        (4, 0): 'B',
                        (4, 5): 'B',
                        (5, 0): 'B',
                        (5, 3): 'B',
                        (5, 4): 'B'})

print( NeighborhoodSubgraphPairwiseDistance(
          normalize=True
      ).fit_transform([g1, g2]) )

The output is:

0.1.8
[[1.         0.22577328]
 [0.22577328 0.75      ]]

Expected behavior Both elements on the diagonal of the resulting matrix should be equal to 1.

ysig commented 10 months ago

Hi @altsoph, thanks for reporting this bug.

So it seems that there is a clear reason why this bug is happening:

    def fit_transform(self, X):
        """Fit and transform, on the same dataset.

        Parameters
        ----------
        X : iterable
            Each element must be an iterable with at most three features and at
            least one. The first that is obligatory is a valid graph structure
            (adjacency matrix or edge_dictionary) while the second is
            node_labels and the third edge_labels (that fitting the given graph
            format). If None the kernel matrix is calculated upon fit data.
            The test samples.

        Returns
        -------
        K : numpy array, shape = [n_input_graphs, n_input_graphs]
            corresponding to the kernel matrix, a calculation between
            all pairs of graphs between target an features

        """
        self._method_calling = 2
        self.fit(X)

        S, N = np.zeros(shape=(self._ngx, self._ngx)), dict()
        for (key, M) in iteritems(self.X):
            K = M.dot(M.T).toarray()
            K_diag = K.diagonal()
            N[key] = K_diag
-->       S += np.nan_to_num(K / np.sqrt(np.outer(K_diag, K_diag)))
            # here things that are num are not propagated

        self._X_level_norm_factor = N

        if self.normalize:
-->       return S / len(self.X) # this is meaningless if for one element adds zero and for others it doesn't
        else:
            return S

If you run:

nspd = NeighborhoodSubgraphPairwiseDistance(normalize=True)
nspd.fit([g1, g2])
S, N = np.zeros(shape=(nspd._ngx, nspd._ngx)), dict()
for (key, M) in iteritems(nspd.X):
    K = M.dot(M.T).toarray()
    K_diag = K.diagonal()
    N[key] = K_diag
    S += np.nan_to_num(np.sqrt(np.outer(K_diag, K_diag))) > 0

print(S)

you get:

/usr/local/lib/python3.10/dist-packages/grakel/kernels/neighborhood_subgraph_pairwise_distance.py:314: RuntimeWarning: invalid value encountered in divide
  S += np.nan_to_num(K / np.sqrt(np.outer(K_diag, K_diag)))
array([[16., 12.],
       [12., 12.]])

which is what is skewing the results as len(X) == 16.

More over if you print the K_diag the nan is clearly visible on the last entries: print(np.sqrt(np.outer(K_diag, K_diag)))

0.1.8
[[1.         0.22577328]
 [0.22577328 0.75      ]]
[[36. 36.]
 [36. 36.]]
[[ 8.          9.79795897]
 [ 9.79795897 12.        ]]
[[ 8.         12.64911064]
 [12.64911064 20.        ]]
[[36.         26.83281573]
 [26.83281573 20.        ]]
[[144. 144.]
 [144. 144.]]
[[18.         26.83281573]
 [26.83281573 40.        ]]
[[ 18.          43.26661531]
 [ 43.26661531 104.        ]]
[[144.         122.37646833]
 [122.37646833 104.        ]]
[[64. 16.]
 [16.  4.]]
[[12.          6.92820323]
 [ 6.92820323  4.        ]]
[[12.          6.92820323]
 [ 6.92820323  4.        ]]
[[64. 16.]
 [16.  4.]]
[[100.   0.]
 [  0.   0.]]
[[18.  0.]
 [ 0.  0.]]
[[18.  0.]
 [ 0.  0.]]
[[100.   0.]
 [  0.   0.]]

Grakel was built at 2017 (before even deep-learning and was working even with python-2). At the time we tested it was working.

For a vector to have nan with itself is a bit weird result, so I kindly ask @giannisnik to do some further examination on what is a principled way to resolve this normalization issue for the nan cases and we will both reply here and push it as a bugfix to the version (0.1.10) - (need to fix having the correct version number too because it's 0.1.9).

altsoph commented 10 months ago

Thanks a lot!

giannisnik commented 10 months ago

Hi @altsoph ,

The issue is due to the default value of hyperparameter d of the kernel (default value is equal to 4). If you set d=2, no problem occurs. The reason behind the problem is that the kernel iteratively looks for pairs of nodes that are at distance $\delta$ for $\delta \in { 1, \ldots, d}$. The second graph you created (i.e., g2) consists of two connected components. The first is just a pair of nodes connected by an edge. The second component contains 4 nodes and the maximum shortest path distance between any two of those nodes is 2. Thus, for d=3 and d=4, the kernel finds no pair of nodes from g2 that satisfy the distance constraint and this is why kernel value for d=3 and d=4 is equal to 0 which leads to nan values.

@ysig I think this can be fixed just by replacing nanon the diagonal of K / np.sqrt(np.outer(K_diag, K_diag) with an 1 instead of a 0.

altsoph commented 10 months ago

Hi @giannisnik,

Thanks for clarifications!

However, even with d=2 NeighborhoodSubgraphPairwiseDistance is having trouble with some disconnected graphs. Here is another example with an isolated node:

g1 = Graph([[0., 0., 0.,],
            [0., 0., 1.,],
            [0., 1., 0.,]],
           node_labels=[1,1,1],
           edge_labels={(1, 2): 'B',
                        (2, 1): 'B',})

g2 = Graph([[0., 1., 1.,],
            [1., 0., 0.,],
            [1., 0., 0.,]],
           node_labels=[1,1,1],
           edge_labels={(0, 1): 'B',
                        (0, 2): 'B',
                        (1, 0): 'B',
                        (2, 0): 'B',})

print( NeighborhoodSubgraphPairwiseDistance(
          normalize=True, d=2
      ).fit_transform([g1, g2]) )

The output is:

[[0.66666667 0.23333333]
 [0.23333333 1.        ]]

Perhaps, the problem appears when there is a component with the radius less than d.

giannisnik commented 10 months ago

@altsoph Yes, that's exactly when it appears. As @ysig said, we will fix this in the next version.

ysig / GraKeL

NeighborhoodSubgraphPairwiseDistance kernel returns diagonal elements less than 1 #94