yakir-yehuda / InterrogateLLM

3 stars 4 forks source link

Question Regarding Evaluation Methodology in Books Dataset #2

Closed c7785812 closed 2 months ago

c7785812 commented 2 months ago

Hello again, I hope this message finds you well.

In your paper, you describe the evaluation methodology for the Books dataset as checking for a match between the elements (author name, release year) in the answer. However, in the code, the answer_args for the Books dataset also include the publisher. Could you please explain why the publisher information is included in the answer_args despite the paper not mentioning its use in the evaluation process?

Additionally, the books_answer_heuristic function in the code uses a threshold of 3 for matching elements in the answer. The paper, however, does not specify why this threshold was chosen over a threshold of 2, which would seem more consistent with the two elements (author name and release year) mentioned. Could you provide some insight into the rationale behind setting the threshold to 3?

Understanding these choices would be very helpful for my own research, as I am looking to replicate and build upon the methods used in your study.

Thank you very much for your time and assistance. I am looking forward to your response.

yakir-yehuda commented 2 months ago

Hi,

We initially considered including the publisher information in the evaluation, but decided to focus on just two elements: the author name and the release year, as mentioned in the paper.

Regarding the threshold used in the books_answer_heuristic function, while the code sets a default matching threshold of 3, this is not relevant for our evaluation. In the evaluation script (eval_results.py), we specifically set heuristic_thresholds to use a value of 2 for books, aligning with the elements we are evaluating (author name and release year). Thus, the threshold of 3 is not applied in our actual evaluation, and the relevant matching threshold remains 2.

Thanks