ysig / GraKeL

A scikit-learn compatible library for graph kernels
https://ysig.github.io/GraKeL/
Other
593 stars 97 forks source link

NaN error when using Random walk kernel on certain datasets #81

Closed amanuelanteneh closed 2 years ago

amanuelanteneh commented 2 years ago

Describe the bug When using the GraKel implementation of the Random walk kernel on the PTC_FM dataset I get the following error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

To Reproduce Steps to reproduce the behavior:

from grakel.datasets import fetch_dataset
from sklearn.model_selection import train_test_split
from grakel import GraphKernel
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

dataset = fetch_dataset("PTC_FM", verbose=False)
G = dataset.data
y = dataset.target
knl = {"name": 'random_walk', "with_labels": False, "lamda": 0.1, "method_type": 'fast', "kernel_type": 'geometric'}

kernel = GraphKernel(kernel=knl, normalize=True)
G_train, G_test, y_train, y_test = train_test_split(G, y, test_size=0.1, random_state=42)
K_train = kernel.fit_transform(G_train)
K_test = kernel.transform(G_test)
clf = SVC(kernel='precomputed')
clf.fit(K_train, y_train)
SVC(kernel='precomputed')
y_pred = clf.predict(K_test)

print("%2.2f %%" %(round(accuracy_score(y_test, y_pred)*100)))

Expected behavior I expect the code to produce the classification accuracy

Stack Trace ValueError Traceback (most recent call last)

in 22 23 clf = SVC(kernel='precomputed') ---> 24 clf.fit(K_train, y_train) 25 SVC(kernel='precomputed') 26 y_pred = clf.predict(K_test)

4 frames

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype) 114 raise ValueError( 115 msg_err.format( --> 116 type_err, msg_dtype if msg_dtype is not None else X.dtype 117 ) 118 )

ysig commented 2 years ago

Well normally it should - could also be an issue of parameter. @giannisnik can you have a look?

giannisnik commented 2 years ago

Hi @amanuelanteneh , This is because of the value of hyperparameter lamda. If you set its value to 0.001, no NaN values emerge and classification is performed successfully. This has to do with the convergence properties of the geometric series. See the original paper for more details.

amanuelanteneh commented 2 years ago

Hi @giannisnik and @ysig, Thank you for the reply. This has fixed the issue.