Knowledge Graph as Input: question-specific subgraphs

cmavro commented 2 years ago

Hi, very exciting work!

I have a question on how you create the question-specific subgraphs when using Knowledge Graphs as input (i.e., ComplexWebQ). By navigating in compwebq/test.jsonl, I see that the maximum number of triplets used over all questions is 61 and that at least one answer lies within the subgraphs in 2725/2816 (96.8%) test questions.

Do you use specific mechanisms to prune irrelevant facts and how you make sure to contain the answers?

Thanks a lot!

Timothyxxx commented 2 years ago

Hi Costas,

Thanks for double-checking! The way we get a KG subgraph for each question is:

First we get the gold entities in the question and get their 2 hop neighbors on the KG to get the initial KG subgraph
We then use the gold SPARQL query to identify the KG triples necessary for reaching the answer entity, and keep those triples (so that the question is answerable).
We then sample additional KG triples from the initial KG subgraph and keep them until we reach 60 triples in total (because 60 triples can roughly fit in the language model context length like T5) Those 60 triples (necessary ones + sampled ones) will be the final input KG for the model.

Hope this information is helpful!

Thanks

cmavro commented 2 years ago

Thanks for the clarification!

cdhx commented 2 years ago

For the question that we do not know answers(e.g. GrailQA test set), how to sample the subgraph?

And where can I find the sample code?

Thx

Timothyxxx commented 2 years ago

Hi, since the GrailQA has hidden test set, our test set is the split of dev set.

For WebQSP, as the original dataset does not have dev set, we split the original train set into in-house train/dev sets (90%/10%), following prior practice (e.g. Ren et al. (2021)). Similarly, for CompWebQ, as the test set is not publicly avail- able, we split the original dev set into in-house dev/test sets (20%/80%). For GrailQA, we split the original dev set into in-house dev/test sets (5%/95%).

Information above will be added in the next version of arxiv we are preparing. And sorry for the confusion in the last version. Hope this information helpful!

Thanks

cdhx commented 2 years ago

Thanks for your replay, it is helpful

I also wonder how to sample the subgraph (the 60 triples), so I can quickly test a question in dataset. colab has not provide knowledge based question answering usage.

Besides, where can I find the jsonl file mentioned by @cmavro

thanks a lot!

Timothyxxx commented 2 years ago

Thanks! I think I see where the problem lies. Sorry for confusing and allow me to explain.

TLDR;

1, Check out the data we processed for you in here to test your question, just search for your question and combine the text sequence and structured knowledge sequence together as an input that in colab will be fine! 2, The jsonl lies in here, we use the huggingface to download and prepare it in this segment of code.

Longer version of more details😃 The logic of the UnifiedSKG framework is: 1) Downloading and read-in the raw data from their source by scripts in ./tasks file. 2) Convert them into the seq2seq version by the scripts in ./seq2seq_construction file. 3) Run experiments from the models in ./models file through train.py and control the procedure by configs in ./configure file and args inputed through command line. We trained the weight through the UnifiedSKG framework and upload the trained weight for usage.

However, considering that not everyone want to go through all this framework and we want to attract readers by easy usage, we provide a usage demo in colab which actually simplifies the data loading procedure, allowing user to input what they like instead of from some dataset for play. So it is actually different in coding logic(a.k.a One is for developers and one is for audiences).

Hope this information is helpful!

Thanks

sythello commented 1 year ago

Hello, I was wondering about the same questions that people were asking here and thank you so much for the detailed answers! I have a follow-up question - In the subgraph extraction step, it seems that the gold SPARQL query is needed even during test time, which is a bit unusual. In this way, is it directly comparable to other methods?

xlang-ai / UnifiedSKG

Knowledge Graph as Input: question-specific subgraphs #9