Question on FilCo Setting Reproduction

First of all, thanks for the amazing work. I first followed your instructions to train the generation model on the SILVER setting on the NQ dataset, I can obtain very similar results reported in the paper. However, I face several issues, when I reproduce the FilCo setting.

If I understand correctly, the following script is used to generate FilCo examples for the generation model:
```
python replace_context.py \
--dataset_path "./datasets/nq/base/test.json" \
--predset_path "./output/nq/mctx/filco-em_tuned-ft5.json" \
--output_path "./datasets/nq/mgen/em/test_em_top1_predict-ft5.json" \
--process_dataset nq
```
In line 51 of filco/replace_context.py, the code is: input_text = get_input_text(pex["input"], pex["output"]), where the pex is the per instance example from predset_path="./output/nq/mctx/filco-em_tuned-ft5.json". Based on my understanding, you use pex["output"] as the filtered context for the FilCo setting. But when I track how you obtain pex["output"], I found something weird. You construct predset_path by the following script:
```
python query.py \
--dataset_path "./datasets/nq/mctx/em/test_em_top1.json" \
--output_path "./output/nq/mctx/filco-em_tuned-ft5.json" \
--model_name_or_path "./checkpoints/nq-mctx_filco-em"
```
Based on line 107 of filco/query.py, the pex["output"] is directly copied from the "output" field of the per-instance example from dataset_path="./datasets/nq/mctx/em/test_em_top1.json". If I understand correctly, the "output" field in dataset_path="./datasets/nq/mctx/em/test_em_top1.json" is the target (ground-truth) output for testing the context filtering model. In other words, this is an Oracle filtering context that you construct with access to the gold answer.

In summary, based on the current version of the provided code, pex["output"] is the orcale filtering context instead of the model filtering context. As a result, the orcale filtering context will be used for testing the generation model in the FilCo setting.

I might misunderstand the code, but could you please answer my doubts?

If I replace line 51 of filco/replace_context.py" with the code input_text = get_input_text(pex["input"], pex["pred_answers"][0]), It uses the model filtering context by the trained context filtering model, aligning with the FilCo setting. However, the performance will drop to EM=0.3485 when I use top-1 passages in the NQ dataset. For your reference, in Silver setting, the model can obtain EM=0.4424.

zorazrw / filco

Question on FilCo Setting Reproduction #10