yifan-h / CS-GNN

Measuring and Improving the Use of Graph Information in Graph Neural Networks
MIT License
82 stars 10 forks source link

sorry to matter you again #4

Closed junkangwu closed 3 years ago

junkangwu commented 4 years ago

In my recent learning, I'm unsure how to caculate $\lambda_l$. In other words, Do you use test dataset nodes in your calculations of label smoothness? If the situation in the traing set is: The two vertices of an edge belong to the training set and the test set respectively. Then this edge will be ignored automaticly?

junkangwu commented 4 years ago

And what's the difference of <train_prefix>-feats.npy vs. <train_prefix>-feats_t.npy? May I ask you about how to understand the meaning of node topology features?

yifan-h commented 4 years ago

Hi there, I think you meet a common problem in graph-based machine learning. Let me answer your questions first. For the feature smoothness calculation, the results that we reported are derived from the entire graph including training, validation and testing nodes. So edges that connect nodes belonging to training and testing sets respectively would not be ignored.

We do it in that way for reasons. Currently datasets/graphs used to evaluate GNNs are fairly small. Thus if you calculate label smoothness only based on training nodes, the result would be highly-related to your data split. If you change your way to split the data, or if you resplit your data randomly, the results would be very different, especially for semi-supervised task. Besides, the smoothness values that we reported are mainly used to reveal the relationship between GNNs performance and graph features. If you want to use them in your model, you can sample some nodes (from training sets) to calculate the label smoothness as mentioned in the paper.

As for the files, <train_prefix>-feats.npy represents the original features of nodes, and <train_prefix>-feats_t.npy represents the topology features of nodes generated by GraphWave method, which helps the model to differentiate nodes with similar orignial features but dissimilar local topologies.

junkangwu commented 4 years ago

Thanks for your detailed explaination! Thank you so much. I have been aware of my wrong thought. So in your model, label smoothness is used to remove some node with negative information. In term of the definition of label smoothness, it is determined by the proportion of negative information in the whole graph.By the way, I'm consider one extreme situation: If node a is connected to node b and node c. Node a and node b belong to the same class and both of them are in the training set while node c belongs to another class in the test dataset. If the edge between node a and node c is not removed in the first step of your model. Then node c will be closer to node a under the definition of your attention coefficient which is wrong . Of course, it may just be an extreme case. Thanks a lot! from a junior student from China.

yifan-h commented 4 years ago

Great! I'm glad that it helps. The data split on graphs is complex, and current unmatured split methods bother many researchers in related filed.

Since there is no more question, I'll close the issue now.

liu-jc commented 4 years ago

Hi there, I think you meet a common problem in graph-based machine learning. Let me answer your questions first. For the feature smoothness calculation, the results that we reported are derived from the entire graph including training, validation and testing nodes. So edges that connect nodes belonging to training and testing sets respectively would not be ignored.

We do it in that way for reasons. Currently datasets/graphs used to evaluate GNNs are fairly small. Thus if you calculate label smoothness only based on training nodes, the result would be highly-related to your data split. If you change your way to split the data, or if you resplit your data randomly, the results would be very different, especially for semi-supervised task. Besides, the smoothness values that we reported are mainly used to reveal the relationship between GNNs performance and graph features. If you want to use them in your model, you can sample some nodes (from training sets) to calculate the label smoothness as mentioned in the paper.

As for the files, <train_prefix>-feats.npy represents the original features of nodes, and <train_prefix>-feats_t.npy represents the topology features of nodes generated by GraphWave method, which helps the model to differentiate nodes with similar orignial features but dissimilar local topologies.

I'd like to confirm whether you need the entire graph (include test graph) to calculate the label smoothness? Then you also use this label smoothness value for training model? It would be totally fine for using the entire graph for calculating the feature smoothness. But for the label smoothness, will it lead to data leakage?

yifan-h commented 4 years ago

Hi,

As mentioned in paper, we can only use part of data (training set) to get a rough value of label smoothness. Theoretically we cannot get the ground-truth value of label smoothness if our task is node classification. Otherwise it may cause data leakage as you mentioned.

But here we report the value based on the whole graph. The reason is that those two measurements are not only for node classification task. We just want readers to get the exact value to see the effectiveness of the measurements. If you want to use those two measurements in your model, and your task is similar to node classification, the right way is to use the estimated value.

liu-jc commented 4 years ago

Thank you for your reply! Now I understand. So can you provide the estimated values you used in your experiments? And can you share some insights about how to estimate it? I think if we only consider the label smoothness in the training set as the estimated value, it would not provide a good result. The reason is that in semi-supervised setting the training set is quite small. Then the value in training set would be quite different with that on the whole graph.

yifan-h commented 4 years ago

You may refer to ./smoothness for the code, which is used to compute two smoothness metrics.

If the task setting is semi-supervised, it's a bit difficult to use the metric, since the distribution of labeled subgraph may be very different from the unknow whole graph. Perhaps you can use label propagation algorithms to annotate some nodes, then estimate the label smoothness based on the larger subgraph.