ysig / GraKeL

A scikit-learn compatible library for graph kernels
https://ysig.github.io/GraKeL/
Other
593 stars 97 forks source link

Edge labels not working for custom dataset #109

Open j-adamczyk opened 4 months ago

j-adamczyk commented 4 months ago

Describe the bug I'm trying to create a custom dataset for Grakel:

def smiles_to_grakel_graphs(smiles_list: list[str]) -> list[grakel.Graph]:
    """
    Transforms list of SMILES strings into list of graphs in GraKeL library format.

    We use atomic numbers as discrete node labels.
    """
    mols = [MolFromSmiles(smiles) for smiles in smiles_list]
    graphs = []

    bond_type_to_int = {
        "SINGLE": 1,
        "DOUBLE": 2,
        "TRIPLE": 3,
        "AROMATIC": 4,
    }

    for mol in mols:
        graph = nx.Graph()

        for atom in mol.GetAtoms():
            graph.add_node(atom.GetIdx(), atom_label=atom.GetAtomicNum())

        for bond in mol.GetBonds():
            # default = OTHER
            bond_type = bond_type_to_int.get(str(bond.GetBondType()), 5)
            graph.add_edge(
                bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond_label=bond_type
            )

        graphs.append(graph)

    graphs = list(
        graph_from_networkx(
            graphs, as_Graph=True, node_labels_tag="atom_label", edge_labels_tag="bond_label"
        )
    )
    return graphs

This should result in graphs with edge labels. However, later in cross-validation, I get:

/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/grakel/graph.py:314: UserWarning: changing format from "adjacency" to "all"
  warnings.warn('changing format from "adjacency" to "all"')
Traceback (most recent call last):
  File "/home/jakub/PycharmProjects/pesticide_bee_toxicity_prediction/src/graph_kernels.py", line 155, in <module>
    train_graph_kernel_SVM(
  File "/home/jakub/PycharmProjects/pesticide_bee_toxicity_prediction/src/graph_kernels.py", line 128, in train_graph_kernel_SVM
    model.fit(graphs_train, y_train)
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 970, in fit
    self._run_search(evaluate_candidates)
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 1527, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 947, in evaluate_candidates
    _warn_or_raise_about_fit_failures(out, self.error_score)
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 536, in _warn_or_raise_about_fit_failures
    raise ValueError(all_fits_failed_message)
ValueError: 
All the 25 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/pipeline.py", line 471, in fit
    Xt = self._fit(X, y, routed_params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/pipeline.py", line 408, in _fit
    X, fitted_transformer = fit_transform_one_cached(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/joblib/memory.py", line 312, in __call__
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/pipeline.py", line 1303, in _fit_transform_one
    res = transformer.fit_transform(X, y, **params.get("fit_transform", {}))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/grakel/kernels/neighborhood_subgraph_pairwise_distance.py", line 308, in fit_transform
    self.fit(X)
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/grakel/kernels/kernel.py", line 124, in fit
    self.X = self.parse_input(X)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/grakel/kernels/neighborhood_subgraph_pairwise_distance.py", line 138, in parse_input
    x.get_labels(purpose="adjacency", label_type="edge"))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jakub/.cache/pypoetry/virtualenvs/pesticide-bee-toxicity-prediction-Sj4YDJPR-py3.11/lib/python3.11/site-packages/grakel/graph.py", line 750, in get_labels
    raise ValueError('Graph does not have any labels for edges.')
ValueError: Graph does not have any labels for edges.

My pipeline is:

kernel = NeighborhoodSubgraphPairwiseDistance(normalize=True)
svm = SVC(
    kernel="precomputed",
    probability=True,
    class_weight="balanced",
    cache_size=1024,
    random_state=0,
)
params_grid = {"svm__C": [1e-2, 1e-1, 1, 1e1, 1e2]}
pipeline = Pipeline([("kernel", kernel), ("svm", svm)])
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=params_grid,
    scoring="roc_auc",
    cv=5,
    n_jobs=1,
)

graphs_train = smiles_to_grakel_graphs(smiles_train)
graphs_test = smiles_to_grakel_graphs(smiles_test)

model.fit(graphs_train, y_train)

EDIT: interestingly, labels initially seem to be there - print(graphs_train[0].edge_labels) results in {(0, 1): 2, (1, 0): 2, (1, 2): 1, (1, 3): 1, (2, 1): 1, (3, 1): 1, (3, 4): 2, (3, 5): 1, (4, 3): 2, (5, 3): 1}. I also tried using this without pipeline, just computing the kernel, but I get the same error.

j-adamczyk commented 4 months ago

It turns out that I had single-atom molecules in my dataset, and that was the reason for the error. However, maybe it could be made more descriptive? Also, no labels + no edges is a completely correct input in many cases, so I think it should be handled properly.