Closed hengruizhang98 closed 3 years ago
Hi, thank you for your interest.
In line 28 of data.py you can see that we invoke utils.create_masks(data=dataset.data)
to create train/val/test masks/splits. If you navigate to _createmasks function inside the utils.py module, online 200 you can find that we first check whether the data contains a validation mask (if not hasattr(data, "val_mask")
). Since the citation networks data have val_mask attribute, we will not create a new one.
Thanks for your response. According to my knowledge, in self-supervised setting (use 'cora' dataset as an example), in pretraining step all the nodes(2708) will be used. In linear evaluation step. Only the training nodes (140) will be used to train the linear classifier, and the testing ndoes(1000) will be used only for evaluation. However, it seems that you split the testing nodes into train/test sets with 0.6/0.4 ratios(600 for train and 400 for test).
Oh! I miss understood you question. In that case you're right. We use a random (60/40) split of the test set for the LogisticRegression classifier.
Yes. So I guess you have to update your codes and manuscripts, to compare fairly with other models.
BTW, a study discusses how using different splits will result in significantly different outcomes. Thus, we mention the split to indicate which particular splits of the publicly available ones we used for the three citation datasets. However, I agree that we need to state this clearly in the manuscript and I'll update it! Thank you for bringing this into light.
Hi, thanks for your nice work. I find that in the original paper you state that you use the public split on the citation networks. However, in this repo it seems that you use random split. Can you explain it?