Not use public split on citation networks

zekarias-tilahun / SelfGNN

A PyTorch implementation of "SelfGNN: Self-supervised Graph Neural Networks without explicit negative sampling" paper, which appeared in The International Workshop on Self-Supervised Learning for the Web (SSL'21) @ the Web Conference 2021 (WWW'21).

29 stars 6 forks source link

Not use public split on citation networks #1

Closed hengruizhang98 closed 3 years ago

hengruizhang98 commented 3 years ago

Hi, thanks for your nice work. I find that in the original paper you state that you use the public split on the citation networks. However, in this repo it seems that you use random split. Can you explain it?

zekarias-tilahun commented 3 years ago

Hi, thank you for your interest.

In line 28 of data.py you can see that we invoke utils.create_masks(data=dataset.data) to create train/val/test masks/splits. If you navigate to _createmasks function inside the utils.py module, online 200 you can find that we first check whether the data contains a validation mask (if not hasattr(data, "val_mask")). Since the citation networks data have val_mask attribute, we will not create a new one.

hengruizhang98 commented 3 years ago

Thanks for your response. According to my knowledge, in self-supervised setting (use 'cora' dataset as an example), in pretraining step all the nodes(2708) will be used. In linear evaluation step. Only the training nodes (140) will be used to train the linear classifier, and the testing ndoes(1000) will be used only for evaluation. However, it seems that you split the testing nodes into train/test sets with 0.6/0.4 ratios(600 for train and 400 for test).

zekarias-tilahun commented 3 years ago

Oh! I miss understood you question. In that case you're right. We use a random (60/40) split of the test set for the LogisticRegression classifier.

hengruizhang98 commented 3 years ago

Yes. So I guess you have to update your codes and manuscripts, to compare fairly with other models.

zekarias-tilahun commented 3 years ago

BTW, a study discusses how using different splits will result in significantly different outcomes. Thus, we mention the split to indicate which particular splits of the publicly available ones we used for the three citation datasets. However, I agree that we need to state this clearly in the manuscript and I'll update it! Thank you for bringing this into light.