What is the format of SPO.txt in data/files/ques_*/ directory?

apoorvumang commented 2 years ago

Hi, thanks for the interesting work

I'm assuming SPO.txt contains a processed subgraph around the topic entities present in a question. What format is it stored in? It doesn't seem to be any 'triple' format.

For eg., data/files/ques_100/SPO.txt contains lines like

Q30-ce1df58f8415808bf89804171507f9b0-pq:||corner#Q30#0.9792777100286615#nation||United States||P7295||P1365||Q11184||Julian calendar

How should I interpret these, and can I get back the WikiData subgraph corresponding to these facts?

Thanks

zhenjia2017 commented 2 years ago

Thanks a lot for your interest.

Yes, SPO.txt contains the subgraph around the topic entities, including all one-hop facts of the topic entities. The one-hop facts contain direct facts and qualifier facts.

For each line in the file, "||" is the delimiter to split it into 7 parts.

The format of the first part is: "entity Wikidata id"-"hash"-"ps( or pq)". "hash" is the 32-bit md5 hash for each fact which can be used as the intermediate node if the fact contains a qualifier(like CVT in Freebase). "ps" indicates the line is the main predicate triple. "pq" indicates the line is a qualifier triple.

The second and third parts are the subject. The fourth and fifth parts are predicate. The sixth and seventh are the object of a triple.

The second part contains the subject entity Wikidata id. If the entity is a cornerstone that will be used to compute GSTs, we record its score and matched text in question, and "#" is the delimiter. The third part is the entity label in Wikidata. The format of the object is the same as the subject.

The fourth part is the main predicate and the fifth part is the qualifier predicate.

But it is not the final answer graph of a question for predicting answers. To get the answer graph, we select the facts relevant to a question from SPO.txt, compute GST graph and then enhance the GST graph with temporal facts. The GST graph and its enhanced temporal facts with ranking are stored in "data/temcompactgst".

If there is any issue about the data or code, please let me know.

apoorvumang commented 2 years ago

Reopening for a clarification.

Does data/temcompactgst contain the final GST including both temporal and non-temporal facts? If so, can we ignore data/compactgst if we are concerned with only the final graph on which stage 2 of the algorithm is applied?

zhenjia2017 commented 2 years ago

I am sorry to have confused you. It should be "The GST graph is stored in data/compactgst and its enhanced temporal facts with ranking are stored in 'data/temcompactgst'"

"data/compactgst" contain the final GST which will be used to generate the final graph. And data/temcompactgst contain the temporal facts of the entities in the final GST, which are also useful to generate the final graph on which stage 2 of the algorithm is applied. It is ok without adding temporal facts to the final graph, but the performance on stage 2 will be lower.

apoorvumang commented 2 years ago

So for the example answer graph shown in the paper (Figure 1, regarding school where Obama's children went), the graph shown would be available in which file?

zhenjia2017 commented 2 years ago

The final answer graph is in "data/dictionaries/". In the folder, there are three JSON files including train_subgraph.json, dev_subgraph.json, and test_subgraph.json. The Python file /answer_predict/get_relational_graph.py is used to generate these JSON files. For each question, there is a dictionary and some important keys of it are as follows: { "question": #question text, "corner_entities": #question cornerstone entities (or seed entities), "answers": #ground truth answers, "id": #question id in the benchmark, "subgraph": { "entities": ques_entities, "tuples": tuples #triples in the final answer graph }, "signal": #question temporal signals, "type": #question temporal categories, "tkg": #question temporal fact triples, "tempentities": #entities in question temporal fact triples "tkg", "temprelations": #temporal relations (or predicates) }

zhenjia2017 / EXAQT

What is the format of SPO.txt in data/files/ques_*/ directory? #1