yuxiang-wu / gen-debiased-nli

20 stars 4 forks source link

MNLI dev data #4

Closed tomhosking closed 2 years ago

tomhosking commented 2 years ago

Hi Yuxiang,

What's the preferred format/source for the dev dataset when doing fine tuning? If I point the trainer at the jsonl file downloaded from MNLI, it has the wrong column names:

['annotator_labels', 'genre', 'gold_label', 'pairID', 'promptID', 'sentence1', 'sentence1_binary_parse', 'sentence1_parse', 'sentence2', 'sentence2_binary_parse', 'sentence2_parse']

If I instead use the built in dataloader (ie specify --dev_data mnli) then I get the following error:

Traceback (most recent call last):
  File "scripts/train_nli_synthetic.py", line 445, in <module>
    main()
  File "scripts/train_nli_synthetic.py", line 256, in main
    train_dataset, eval_dataset, _ = get_dataset_splits(args)
  File "scripts/train_nli_synthetic.py", line 149, in get_dataset_splits
    _, dev, _ = get_nli_dataset(args.data_dir, args.dev_data)
  File "/mnt/ext/phd/external/gen-debiased-nli/gen_debiased_nli/data/nli/__init__.py", line 57, in get_nli_dataset
    train_data, dev_data, test_data = load_f(data_dir)
  File "/mnt/ext/phd/external/gen-debiased-nli/gen_debiased_nli/data/nli/mnli.py", line 11, in load_mnli
    train_set = del_columns(raw_dataset["train"])
  File "/mnt/ext/phd/external/gen-debiased-nli/gen_debiased_nli/utils.py", line 66, in del_columns
    dataset.remove_columns_(
AttributeError: 'Dataset' object has no attribute 'remove_columns_'

What's the correct value for --dev_data?

yuxiang-wu commented 2 years ago

Hi Tom,

The dev set is expected to have the same format with train data, so it should contain premise, hypothesis, label. I can fix this by the end of this week.