xlang-ai / BRIGHT

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval
https://brightbenchmark.github.io/
Creative Commons Attribution 4.0 International
52 stars 1 forks source link

vague queries in theorem datasets #5

Closed nabsabraham closed 2 months ago

nabsabraham commented 2 months ago

thanks for this well curated benchmark!

when looking through the theoremqa dataset, I've come across this query Need equation as a duplicate query and I would argue there's not alot of complex reasoning in this query to retrieve the gold document ids it retrieved. Is this a bug or expected?

from datasets import load_dataset
dataset = 'theoremqa_theorems' # / "theoremqa_questions"

data = load_dataset('xlangai/BRIGHT', 'examples')[dataset]
df = data.to_pandas()

corpus = load_dataset('xlangai/BRIGHT', 'documents')[dataset]
corpus_df = corpus.to_pandas()
corpus = dict(zip(corpus_df['id'], corpus_df['content']))

subset = df[df['query']=='Need equation']
row = subset.iloc[0]
print(f'''\nQUERY [{row['id']}]:\n''', row['query'])
gold_doc_ids = row['gold_ids']
print(f'\nDOCUMENT: [{gold_doc_ids[0]}]:\n', corpus[gold_doc_ids[0]])
print(f'\nREASONING:\n', {row['reasoning']})

If I repeat this for the theoremqa_questions dataset, there are more duplicates of the query Need equation and here is one such example:

QUERY [TheoremQA_maxku/signalprocessing3-Ztransform.json]:
 Need equation

DOCUMENT: [TheoremQA_maxku/signalprocessing6-Ztransform.json]:
 The difference equation of a causal system is $y[n]+0.5 y[n-1]=x[n]-x[n-2]$, where $y[n]$ is its output and $x[n]$ is its input. Is the system a FIR filter?
To determine if the system is a FIR filter, we need to check if the impulse response of the system is finite. 

Assuming the input is an impulse signal $x[n]=\delta[n]$, we can find the impulse response of the system by solving the difference equation with initial conditions $y[-1]=y[-2]=0$:

$y[0]+0.5y[-1]=1-0=1 \implies y[0]=1$

$y[1]+0.5y[0]=0-0=0 \implies y[1]= -0.5y[0]= -0.5$

$y[2]+0.5y[1]=0-1=-1 \implies y[2]= -0.5y[1]-1=0.25$

$y[3]+0.5y[2]=0-0=0 \implies y[3]= -0.5y[2]= -0.125$

$y[4]+0.5y[3]=0+1=1 \implies y[4]= -0.5y[3]+1=0.0625$

$\vdots$

We can see that the impulse response of the system is not finite, since it does not decay to zero as $n$ goes to infinity. Therefore, the system is not a FIR filter.

Therefore, the answer is False.

REASONING:
 {'z-transform'}
hongjin-su commented 2 months ago

Hi! Thanks a lot for catching this! We notice that there is a parsing error in processing the dataset theoremqa_theorems, which affects three data points in total. We will soon correct the affected instances and re-upload the dataset!

Let us know if you have further comments!

hongjin-su commented 2 months ago

We have updated the theoremqa_theorems and theoremqa_questions subsets in https://huggingface.co/datasets/xlangai/BRIGHT. In both subsets, two instances are deleted and one is added.

Feel free to re-open the issue if you have more questions or comments!

nabsabraham commented 2 months ago

thanks for addressing this so quick! I checked out theoremqa_theorems and it looks good! For theoremqa_questions, can you let me know if this is expected? Theres two rows with the same query_id but the queries are different.

from datasets import load_dataset
from emb_utils.data import print_dataset_stats, save_qrels, save_corpus, save_queries

dataset = 'theoremqa_questions'
data = load_dataset('xlangai/BRIGHT', 'examples')[dataset]
df = data.to_pandas() 
print(df.shape)  # (196, 6)

print(df['id'].nunique()) # 195

dups = df[df['id']=='TheoremQA_maxku/basic-electronics-2-1.json']

assert dups.iloc[0]['query'] == dups.iloc[1]['query'], 'queries have the same id but are different'
hongjin-su commented 2 months ago

Sorry for the late reply! We will soon check the dataset! Stay tuned!

howard-yen commented 2 months ago

Hi @nabsabraham , thanks for catching this! It seems like the original MathInstruct dataset, which we sourced the theoremqa questions from, contained these two duplicate ids (with the same query), so we also had them. They appeared to have different queries in BRIGHT as we rewrote the same base question, but are identical conceptually. Since there are no other questions with the same theorem, we will remove these two samples from the theoremqa_questions split. We will also be updating the huggingface dataset page.

Thanks for your interest in our dataset and please let us know if you have any other questions!

hongjin-su commented 2 months ago

We have updated the data in the huggingface! Feel free to check it out!