What's the preferred format/source for the dev dataset when doing fine tuning? If I point the trainer at the jsonl file downloaded from MNLI, it has the wrong column names:
If I instead use the built in dataloader (ie specify --dev_data mnli) then I get the following error:
Traceback (most recent call last):
File "scripts/train_nli_synthetic.py", line 445, in <module>
main()
File "scripts/train_nli_synthetic.py", line 256, in main
train_dataset, eval_dataset, _ = get_dataset_splits(args)
File "scripts/train_nli_synthetic.py", line 149, in get_dataset_splits
_, dev, _ = get_nli_dataset(args.data_dir, args.dev_data)
File "/mnt/ext/phd/external/gen-debiased-nli/gen_debiased_nli/data/nli/__init__.py", line 57, in get_nli_dataset
train_data, dev_data, test_data = load_f(data_dir)
File "/mnt/ext/phd/external/gen-debiased-nli/gen_debiased_nli/data/nli/mnli.py", line 11, in load_mnli
train_set = del_columns(raw_dataset["train"])
File "/mnt/ext/phd/external/gen-debiased-nli/gen_debiased_nli/utils.py", line 66, in del_columns
dataset.remove_columns_(
AttributeError: 'Dataset' object has no attribute 'remove_columns_'
The dev set is expected to have the same format with train data, so it should contain premise, hypothesis, label. I can fix this by the end of this week.
Hi Yuxiang,
What's the preferred format/source for the dev dataset when doing fine tuning? If I point the trainer at the jsonl file downloaded from MNLI, it has the wrong column names:
If I instead use the built in dataloader (ie specify
--dev_data mnli
) then I get the following error:What's the correct value for
--dev_data
?