yao8839836 / text_gcn

Graph Convolutional Networks for Text Classification. AAAI 2019
1.35k stars 434 forks source link

Question related to preprocessing data script #58

Open Abhinav43 opened 4 years ago

Abhinav43 commented 4 years ago

Hi Yao,

I am using some graph networks to get embeddings and for that I am using your script to generate graph files. But Some of graph networks require one extra file called 'ind.{file_name}.graph' which is not generated by your script.

So there are only 7 files

indi.wiki.x indi.wiki.y indi.wiki.allx indi.wiki.ally indi.wiki.adj indi.wiki.tx indi.wiki.ty

I am trying to generate indi.wiki.graph file which is

graph, a dict in the format {index: [index_of_neighbor_nodes]}, where the neighbor nodes are organised as a list.

as mentioned in this repository https://github.com/kimiyoung/planetoid

Can you please help em generate indi.wiki.graph ?

or let me know how to generate please?

Thanks in advance !

yao8839836 commented 4 years ago

@Abhinav43

In build_graph.py , elements in "row" will be "index" in your dict, elements in "col" will be "index_of_neighbor_nodes" in your dict, please see line 449-450, line 492-496 in build_graph.py.

You may also want to check if the adj matrix is symmetric, i.e., j is a neighbor node of i, then i is also a neighbor node of j.

Abhinav43 commented 4 years ago

@yao8839836 , Thank you for quick reply.

Which row and col I have to use for graph? line 449-450 or line 492-496. I have two questions :

Can I generate graph from adj matrix? row as index and col as neighbours? If i have edges.txt, labels.txt how can I build all those files? I mean how can i give that input format to your script?

Thank you !

yao8839836 commented 4 years ago

@Abhinav43

Line 449-450 and line 492-496 are for the same "row" and "col".

"Can I generate graph from adj matrix? row as index and col as neighbours?"

Yes, you can, but it would be easier to generate graph from "row" and "col" instead of "adj".

"If i have edges.txt, labels.txt how can I build all those files? I mean how can i give that input format to your script?"

With edges.txt, you know node pairs (0, 1) (0, 2), (1,2). Then you will add [0, 0, 1] to "row", [1, 2, 2] to "col" and [1, 1, 1] to "weight" in build_graph.py. With labels.txt, you will know node labels, you can convert them to one-hot vectors and add them to indi.wiki.y, indi.wiki.ally and indi.wiki.ty. The example is line 357-371 in build_graph.py.

Abhinav43 commented 4 years ago

@yao8839836

Thank you for quick response.

So I have added node pairs to 'row' line 430 , and col line 431 as you said. Also i have converted labels.txt into one hot but instead of using vocab i have used total labels len to create one hot encoding.

Like if I have five labels , a ,b,c,d,e and there is three labels in for one sentence b,c,d then one hot would be [ 0, 1,1,1, 0]

No i can build adj matrix ( ind.wiki.adj ) , graph ( ind.wiki.graph ) but how to build ind.wiki.x , ind.wiki.y , ind.wiki.allx , ind.wiki.ally, ind.wiki.tx, ind.wiki.ty

I am really confused, it would be great help if you can share the draft script for these files please.

Thank you in advance ! waiting for your reply

yao8839836 commented 4 years ago

@Abhinav43

ind.wiki.x is feature vectors of training nodes. The order of nodes is the same as in ind.wiki.adj. ind.wiki.y is the label matrix of ind.wiki.x, which is a list of [ 0, 1,1,1, 0].

ind.wiki.allx is feature vectors of all nodes except test nodes. ind.wiki.ally is the label matrix of ind.wiki.allx

ind.wiki.tx is feature vectors of test nodes. ind.wiki.ty is the label matrix of ind.wiki.tx.

Abhinav43 commented 4 years ago

@yao8839836

Sorry for the trivial question.. Which nodes did you use for training/validating/testing? I'm trying to build the three datasets (and respective masks) and i'm not sure about the steps. So far I have understood what follows:

adj is the adjacency matrix of the graph, with all the nodes (it doesn't matter if they are part of the train/validation/test dataset). Therefore the dimension is NxN, where N is the total number of the nodes in the three datasets. features is an NxD matrix, but in this case is an NxN identity matrix (again, with N=total number of nodes in the three dataset) y_train, y_val and y_test are NxC matrices (C=number of categories)

But I am still confuse how you are splitting them into train/test/val

It would be really really helpful if you can just write a simple script to clear the doubts please if you have time, if you can provide script i preprocess karate dataset and some other with same script.

Here is the dataset https://github.com/shenweichen/GraphEmbedding/tree/master/data/wiki

Thank you again, and sorry for making trivial questions.

yao8839836 commented 4 years ago

@Abhinav43

You are right about adj, features, y_train, y_val and y_test.

I am spliting the nodes based on the standard split of benchmark text datasets.

If you don't have standard split, you can split them randomly. For example, in original GCN paper, the first 140 nodes are training nodes (including validation nodes), others are test nodes.