Closed nabsabraham closed 2 months ago
Hi! Thanks a lot for catching this! We notice that there is a parsing error in processing the dataset theoremqa_theorems
, which affects three data points in total. We will soon correct the affected instances and re-upload the dataset!
Let us know if you have further comments!
We have updated the theoremqa_theorems
and theoremqa_questions
subsets in https://huggingface.co/datasets/xlangai/BRIGHT. In both subsets, two instances are deleted and one is added.
Feel free to re-open the issue if you have more questions or comments!
thanks for addressing this so quick! I checked out theoremqa_theorems
and it looks good! For theoremqa_questions
, can you let me know if this is expected? Theres two rows with the same query_id but the queries are different.
from datasets import load_dataset
from emb_utils.data import print_dataset_stats, save_qrels, save_corpus, save_queries
dataset = 'theoremqa_questions'
data = load_dataset('xlangai/BRIGHT', 'examples')[dataset]
df = data.to_pandas()
print(df.shape) # (196, 6)
print(df['id'].nunique()) # 195
dups = df[df['id']=='TheoremQA_maxku/basic-electronics-2-1.json']
assert dups.iloc[0]['query'] == dups.iloc[1]['query'], 'queries have the same id but are different'
Sorry for the late reply! We will soon check the dataset! Stay tuned!
Hi @nabsabraham , thanks for catching this! It seems like the original MathInstruct dataset, which we sourced the theoremqa questions from, contained these two duplicate ids (with the same query), so we also had them. They appeared to have different queries in BRIGHT as we rewrote the same base question, but are identical conceptually. Since there are no other questions with the same theorem, we will remove these two samples from the theoremqa_questions split. We will also be updating the huggingface dataset page.
Thanks for your interest in our dataset and please let us know if you have any other questions!
We have updated the data in the huggingface! Feel free to check it out!
thanks for this well curated benchmark!
when looking through the theoremqa dataset, I've come across this query
Need equation
as a duplicate query and I would argue there's not alot of complex reasoning in this query to retrieve the gold document ids it retrieved. Is this a bug or expected?If I repeat this for the
theoremqa_questions
dataset, there are more duplicates of the queryNeed equation
and here is one such example: