working-yuhao / DEAL

IJCAI2020
MIT License
20 stars 6 forks source link

How to pre-process the original graph `.npz` data? #2

Closed gu18168 closed 3 years ago

gu18168 commented 3 years ago

From the filenames contained in the data directory, I used the following statement to process the original citeseer.npz file (from https://github.com/abojchevski/graph2gauss)

g = utils.load_dataset(str(Path('data', 'citeseer.npz')))
A, X, z = g['A'], g['X'], g['z']

utils.save_sp('data/', 'A', A)
utils.save_sp('data/', 'X', X)
np.save(Path('data', 'z.npy'), z)

train_ones, val_ones, val_zeros, test_ones, test_zeros = \
    utils.train_val_test_split_adjacency(A, p_val=0.1, p_test=0.1)
np.savez(Path('data', 'data_arrays_link.npz'), train_ones,
         val_ones, val_zeros, test_ones, test_zeros)

But the resulting file size is different from the one you provided, is there any other pre-processing operation I can do? Or did I make an error in my operation?

gu18168 commented 3 years ago

I found that I may have used the wrong raw graph file, and now I can generate the same A and X file. But how do I generate the corresponding data_arrays_link file, because if I call the train_val_test_split_adjacency function directly, it will be broken due to the assert not np.any(A.sum(0).A1 + A.sum(1).A1 == 0) # no dangling nodes . And I had to comment this statement and use every_node=False to run the function.

However, this processing of the dataset will greatly reduce the effectiveness of the experiment, and the results for CiteSeer will be reduced to AUC: 0.7552, AP: 0.7946

My current preprocessing code is:

dataset = get_tg_dataset(args, 'CiteSeer', use_cache=False)

with open(Path(args.output, 'dists-1.dat'), 'wb') as f:
    pickle.dump(dataset[0].dists, f)

node_num = dataset[0].dists.size()[0]
attr_num = dataset[0].x.size()[1]

adj_dict = {}
for (row, col) in zip(dataset[0].mask_link_positive[0],
                      dataset[0].mask_link_positive[1]):
    if row in adj_dict:
        adj_dict[row].append(col)
    else:
        adj_dict[row] = [col]

A_new = get_A(adj_dict, node_num).tocsr().astype('float64')
save_sp(args.output + '/', 'A_new', A_new)

attr_dict = {}
for row, line in enumerate(dataset[0].x):
    for col, value in enumerate(line):
        if value != 0:
            if row in attr_dict:
                attr_dict[row].append(col)
            else:
                attr_dict[row] = [col]

X_new = get_X(attr_dict, node_num, attr_num).tocsr().astype('float32')
save_sp(args.output + '/', 'X_new', X_new)

train_ones, val_ones, val_zeros, test_ones, test_zeros = \
    train_val_test_split_adjacency(A_new, p_val=0.1, p_test=0.1, every_node=False)
np.savez(Path(args.output, 'new_data_arrays_link.npz'), train_ones.astype('int64'),
         val_ones.astype('int64'), val_zeros.astype('int64'),
         test_ones.astype('int64'), test_zeros.astype('int64'))
working-yuhao commented 3 years ago

Hi, I cannot find the preprocess file right now because it was a work about 2 years ago, and most of my current work uses the edge list like Pytorch Geometric directly. I will try to find the file, but I am not sure if I have deleted it while cleaning the server.

Also, using another preprocessing method doesn't matter, because the input file and matrices are really simple. Take A_sp.npz as an example, it simply contains three sparse matrices, which are the adjacency matrix (A), feature matrix (X) and the label information (z) (this is not used in this work). Moreover, it is very easy to convert an edge list to an adjacency matrix if you use another dataset.

For your codes, I think the reason is that the input node features are incorrect if you use the get_X() in the utils.py. Actually, this is for another extension work. I'm sorry that I didn't have the time to clean the codes. You can use the node features provided by the original dataset directly. I will try to add more comments and clean the codes in the future. Hope this can help.

Cheers, Yu

gu18168 commented 3 years ago

Thank you very much for your reply. After my comparison, the preprocessing file I wrote gives me the same A and X as the corresponding file provided. The main problem I find with the drop in experimental results is that val_zeros and test_zeros are somewhat different from what is provided.

The generation of a zeros collection in the train_val_test_split_adjacency function is different from that provided by G2G.

Other nodes that can be reached by triple-hopping between the nodes used by zeros in DEAL, I think there may be a mistake here. After I replaced this code with the original G2G code, the results returned to normal.

working-yuhao commented 3 years ago

Hi Nick,

Glad to hear that the codes work. Thanks for your information. The train_val_test_split_adjacency function in the released reversion may be changed due to the extension work I mentioned before.

Cheers, Yu

FatemeMirzaeii commented 1 year ago

@gu18168 Hi Nick! Can you help me on preprocessing my own graph, to use with this model? Any help would be appreciated.

gu18168 commented 1 year ago

@FatemeMirzaeii Unfortunately, I can't find the code I had at the time because it's been a bit long. I hope you can find a clue to the processing in the comment.

FatemeMirzaeii commented 1 year ago

@FatemeMirzaeii Unfortunately, I can't find the code I had at the time because it's been a bit long. I hope you can find a clue to the processing in the comment.

Thanks for your quick reply! I don't need the exact code, as it will be completely different in my case, I have a raw graph in networkx format. If you can remember and explain to me what input files represent, I would be very grateful. for example dist data, data_array_links or nodes_keep, ind_train and etc.