Closed altsoph closed 8 months ago
Hi @altsoph, thanks for reporting this bug.
So it seems that there is a clear reason why this bug is happening:
def fit_transform(self, X):
"""Fit and transform, on the same dataset.
Parameters
----------
X : iterable
Each element must be an iterable with at most three features and at
least one. The first that is obligatory is a valid graph structure
(adjacency matrix or edge_dictionary) while the second is
node_labels and the third edge_labels (that fitting the given graph
format). If None the kernel matrix is calculated upon fit data.
The test samples.
Returns
-------
K : numpy array, shape = [n_input_graphs, n_input_graphs]
corresponding to the kernel matrix, a calculation between
all pairs of graphs between target an features
"""
self._method_calling = 2
self.fit(X)
S, N = np.zeros(shape=(self._ngx, self._ngx)), dict()
for (key, M) in iteritems(self.X):
K = M.dot(M.T).toarray()
K_diag = K.diagonal()
N[key] = K_diag
--> S += np.nan_to_num(K / np.sqrt(np.outer(K_diag, K_diag)))
# here things that are num are not propagated
self._X_level_norm_factor = N
if self.normalize:
--> return S / len(self.X) # this is meaningless if for one element adds zero and for others it doesn't
else:
return S
If you run:
nspd = NeighborhoodSubgraphPairwiseDistance(normalize=True)
nspd.fit([g1, g2])
S, N = np.zeros(shape=(nspd._ngx, nspd._ngx)), dict()
for (key, M) in iteritems(nspd.X):
K = M.dot(M.T).toarray()
K_diag = K.diagonal()
N[key] = K_diag
S += np.nan_to_num(np.sqrt(np.outer(K_diag, K_diag))) > 0
print(S)
you get:
/usr/local/lib/python3.10/dist-packages/grakel/kernels/neighborhood_subgraph_pairwise_distance.py:314: RuntimeWarning: invalid value encountered in divide
S += np.nan_to_num(K / np.sqrt(np.outer(K_diag, K_diag)))
array([[16., 12.],
[12., 12.]])
which is what is skewing the results as len(X) == 16.
More over if you print the K_diag the nan is clearly visible on the last entries:
print(np.sqrt(np.outer(K_diag, K_diag)))
0.1.8
[[1. 0.22577328]
[0.22577328 0.75 ]]
[[36. 36.]
[36. 36.]]
[[ 8. 9.79795897]
[ 9.79795897 12. ]]
[[ 8. 12.64911064]
[12.64911064 20. ]]
[[36. 26.83281573]
[26.83281573 20. ]]
[[144. 144.]
[144. 144.]]
[[18. 26.83281573]
[26.83281573 40. ]]
[[ 18. 43.26661531]
[ 43.26661531 104. ]]
[[144. 122.37646833]
[122.37646833 104. ]]
[[64. 16.]
[16. 4.]]
[[12. 6.92820323]
[ 6.92820323 4. ]]
[[12. 6.92820323]
[ 6.92820323 4. ]]
[[64. 16.]
[16. 4.]]
[[100. 0.]
[ 0. 0.]]
[[18. 0.]
[ 0. 0.]]
[[18. 0.]
[ 0. 0.]]
[[100. 0.]
[ 0. 0.]]
Grakel was built at 2017 (before even deep-learning and was working even with python-2). At the time we tested it was working.
For a vector to have nan with itself is a bit weird result, so I kindly ask @giannisnik to do some further examination on what is a principled way to resolve this normalization issue for the nan cases and we will both reply here and push it as a bugfix to the version (0.1.10) - (need to fix having the correct version number too because it's 0.1.9).
Thanks a lot!
Hi @altsoph ,
The issue is due to the default value of hyperparameter d
of the kernel (default value is equal to 4). If you set d=2
, no problem occurs. The reason behind the problem is that the kernel iteratively looks for pairs of nodes that are at distance $\delta$ for $\delta \in { 1, \ldots, d}$. The second graph you created (i.e., g2
) consists of two connected components. The first is just a pair of nodes connected by an edge. The second component contains 4 nodes and the maximum shortest path distance between any two of those nodes is 2. Thus, for d=3
and d=4
, the kernel finds no pair of nodes from g2
that satisfy the distance constraint and this is why kernel value for d=3
and d=4
is equal to 0 which leads to nan
values.
@ysig I think this can be fixed just by replacing nan
on the diagonal of K / np.sqrt(np.outer(K_diag, K_diag)
with an 1 instead of a 0.
Hi @giannisnik,
Thanks for clarifications!
However, even with d=2 NeighborhoodSubgraphPairwiseDistance is having trouble with some disconnected graphs. Here is another example with an isolated node:
g1 = Graph([[0., 0., 0.,],
[0., 0., 1.,],
[0., 1., 0.,]],
node_labels=[1,1,1],
edge_labels={(1, 2): 'B',
(2, 1): 'B',})
g2 = Graph([[0., 1., 1.,],
[1., 0., 0.,],
[1., 0., 0.,]],
node_labels=[1,1,1],
edge_labels={(0, 1): 'B',
(0, 2): 'B',
(1, 0): 'B',
(2, 0): 'B',})
print( NeighborhoodSubgraphPairwiseDistance(
normalize=True, d=2
).fit_transform([g1, g2]) )
The output is:
[[0.66666667 0.23333333]
[0.23333333 1. ]]
Perhaps, the problem appears when there is a component with the radius less than d.
@altsoph Yes, that's exactly when it appears. As @ysig said, we will fix this in the next version.
Describe the bug
NeighborhoodSubgraphPairwiseDistance kernel returns diagonal elements less than 1.
To Reproduce
Small snippet to reproduce:
The output is:
Expected behavior Both elements on the diagonal of the resulting matrix should be equal to 1.