Problems in reproducing subgraph retrieval

happen2me commented 1 year ago

Hi Zhen! I'm trying to reproduce your subgraph retrieval method for other datasets, but I encountered several problems.

I am confused about how you got the facts, i.e. SPO.txt. You mentioned CLOCQ in another issue. If I got you right, did you use CLOCQ to retrieve all facts of each grounded entity?
In the paper, an important step is to inject connectivity. In your implementation, however, (AFAIK) the steps of generating the connectivity file are not included. Did you also use CLOCQ's shortest path API to get the shortest path between each question nodes?
Further question regarding point 2: the assumption that 2 works is that you ground at least 2 entities from the question. So if you only linked one or less entity in the question, the method doesn't work at all, am I correct?

I'd really appreciate it if you could share how you made it. Thank you in advance :)

zhenjia2017 commented 1 year ago

Hi, I will publish the code for the problem you mentioned below ASAP.

Hi Zhen! I'm trying to reproduce your subgraph retrieval method for other datasets, but I encountered several problems.

I am confused about how you got the facts, i.e. SPO.txt. You mentioned CLOCQ in another issue. If I got you right, did you use CLOCQ to retrieve all facts of each grounded entity?

In the paper, an important step is to inject connectivity. In your implementation, however, (AFAIK) the steps of generating the connectivity file are not included. Did you also use CLOCQ's shortest path API to get the shortest path between each question nodes?

Further question regarding point 2: the assumption that 2 works is that you ground at least 2 entities from the question. So if you only linked one or less entity in the question, the method doesn't work at all, am I correct?

I'd really appreciate it if you could share how you made it. Thank you in advance :)

zhenjia2017 commented 1 year ago

Hi Zhen! I'm trying to reproduce your subgraph retrieval method for other datasets, but I encountered several problems.

I am confused about how you got the facts, i.e. SPO.txt. You mentioned CLOCQ in another issue. If I got you right, did you use CLOCQ to retrieve all facts of each grounded entity?

In the paper, an important step is to inject connectivity. In your implementation, however, (AFAIK) the steps of generating the connectivity file are not included. Did you also use CLOCQ's shortest path API to get the shortest path between each question nodes?

Further question regarding point 2: the assumption that 2 works is that you ground at least 2 entities from the question. So if you only linked one or less entity in the question, the method doesn't work at all, am I correct?

I'd really appreciate it if you could share how you made it. Thank you in advance :)

I）I use ELQ + TagMe to get the nerd entities for each question, and then I use CLOCQ (https://github.com/PhilippChr/CLOCQ) to retrieve relevant facts for each question. 2) I use CLOCQ to check if the nerd entities are connected in one-hop and two-hops. If they are connected, I use "connect" function of CLOCQ to retrieve all paths. And then I use cosine similarity to choose the best paths for a question. (I will provide the code). 3) If there is only one nerd entity, that can not be grouped into a pair, yes, there is no need to find the best path because all facts share one nerd entity and they should be always connected.

happen2me commented 1 year ago

I）I use ELQ + TagMe to get the nerd entities for each question, and then I use CLOCQ (https://github.com/PhilippChr/CLOCQ) to retrieve relevant facts for each question.

So if I got you right, you used the GET /api/search_space API described here to retrieve the question-related facts, instead of searching for neighbors of grounded entities with GET /api/neighborhood? Therefore the entity linking step and the facts retrieval step are standalone to each other.

If so, isn't is possible that the retrieved facts have nothing to do with the linked entities? This is against the common approach where the facts are retrieved by searching connections with the linked entities.

An imaginary example:

question: How old is Barack Obama's daughter
retrieved facts: [Old Henry - daughter - Anna]
linked entity: Barack_Obama

Also, for this example, the statement there is no need to find the best path because all facts share one nerd entity and they should be always connected does not hold -- the facts are not connected with the nerd entity.

zhenjia2017 commented 1 year ago

I）I use ELQ + TagMe to get the nerd entities for each question, and then I use CLOCQ (https://github.com/PhilippChr/CLOCQ) to retrieve relevant facts for each question.

So if I got you right, you used the GET /api/search_space API described here to retrieve the question-related facts, instead of searching for neighbors of grounded entities with GET /api/neighborhood? Therefore the entity linking step and the facts retrieval step are standalone to each other.

If so, isn't is possible that the retrieved facts have nothing to do with the linked entities? This is against the common approach where the facts are retrieved by searching connections with the linked entities.

An imaginary example:

question: How old is Barack Obama's daughter

retrieved facts: [Old Henry - daughter - Anna]

linked entity: Barack_Obama

Also, for this example, the statement there is no need to find the best path because all facts share one nerd entity and they should be always connected does not hold -- the facts are not connected with the nerd entity.

EXAQT does not use search_space API to retrieve facts. EXAQT uses neighborhood API to retrieve facts for NERD entities.

happen2me commented 1 year ago

Thank you, this solves my problem :)

zhenjia2017 commented 1 year ago

Thank you, this solves my problem :)

I added three scripts in answer_graph folder. (1) "seed_path_extractor.py" for retrieving the best path between seed entities. (2) "get_CLOCQ_Wikidata_SPOs.py" for retrieving facts from CLOCQ and converting to spo format. (3) "get_fact_for_question" for generating spo.txt for each question in the benchmark.

zhenjia2017 / EXAQT

Problems in reproducing subgraph retrieval #6