Open ZifengDing opened 1 year ago
Another question would be like: I noticed that you only report hits@1. In your evaluation you try to find the ground truth missing entity in your generated text. I am not sure whether it is over optimistic since sometimes LLM hallucinates and maybe it is not treating the missing ground truth as the predicted answer but somehow the output text contains this entity (e.g., LLM outputs "I am not sure which one is the answer, but I think the ground truth should be either {ground truth} or {another entity}"). And another problem might occur is that it seems impossible to calculate hit@3,10 and mean reciprocal rank (MRR) in your framework. It is not a big problem though, but do you have any idea of how we can incorporate these metrics into your work? For example, in this work (https://arxiv.org/pdf/2305.10613.pdf) it just treats entities and relations as pure numbers and rank the numbers. I also suspect how it works because no code was given. Since models like LLAMA are using BPE tokenizer, they treat each digit separately so it is not easy to rank the scores of multi-dight numbers, e.g., 15, 234.
@ZifengDing
Hi, thank you for your insightful comments on our work.
It took 3h15min for LLaMA-7B to train on WN18RR and 38h29min on YAGO3-10 with an A100 GPU. It took 4h7min for LLaMA-13B to train on WN18RR and 49h53min on YAGO3-10 with an A100 GPU.
In our experiments, we found the original LLaMA and ChatGLM as well as ChatGPT and GPT-4 answer like your examples, while the fine-tuned LLaMA and ChatGLM give concise and exact answers which can be easily evaluated, as shown in Table 6. In addition, we manually label the answer as correct or wrong for FB13-100 and YAGO3-10-100.
For hit@3,10 and MRR, maybe we can design an effective prompt to obtain 3 or 10 answers, e.g., "Please give three/ten possible entities, with the more reliable answers listed higher" and we can give some few-shot examples to enhance the prompt.
Thanks for the impressive work. I was also trying to fintune LLAMA 7b for KG link prediction on my own dataset. I was using huggingface trainer and it cost me huge amount of time for finetuing. As the training time is not indicated in your paper, may I ask how much time you spent in training an entity prediction model? It is very important because traditional KG completion models can achieve good performance with much fewer training cost. If LLMs cannot outperform them even with much large time consumption, it would be not that practical.
Please correct me if I have misunderstanding. Cheers