About dataset - Githubissues

Neoyanghc commented 3 years ago

hi, i'm confused by how you create the train20.txt and test20.txt, does it just randomly select from the all dataset thanks

zhumeiqiBUPT commented 3 years ago

Hello, we randomly select training nodes (for ACM, UAI, BlogCatalog, Coraml, Flickr....) but make sure that training rate are 20/40/60 per class. And we use the last 1000 nodes for test. As for Citeseer, we select training nodes in order (without randomly shuffling) just to keep the same setting with GCN. I can further show the corresponding codes here:

import numpy as np
np.random.seed(20) #random seed
idx_train = []
idx_test = []
label = np.loadtxt('acm.label', dtype = int)
label_rate = 60  #or 40, 20

n = 3025
class_num = 3
train_num = class_num * label_rate
idx = list(range(n))
np.random.shuffle(idx)
count = [0] * class_num
for i in range(len(idx)):
    l = label[idx[i]]
    if count[l] < label_rate:
        idx_train.append(idx[i])
        count[l] = count[l] + 1
for i in range(len(idx)-1000, len(idx)):
    idx_test.append(idx[i])
np.savetxt('test' + str(int(label_rate)) + '.txt', idx_test, fmt = '%d')
np.savetxt('train' + str(int(label_rate)) + '.txt', idx_train, fmt = '%d')

Neoyanghc commented 3 years ago

Thanks

zhumeiqiBUPT / AM-GCN

About dataset #11