Open KoichiYasuoka opened 2 years ago
!test -d transformers-4.19.2 || git clone -b v4.19.2 --depth=1 https://github.com/huggingface/transformers transformers-4.19.2
!test -d JGLUE || ( git clone --depth=1 https://github.com/yahoojapan/JGLUE && cat JGLUE/fine-tuning/patch/transformers-4.9.2_jglue-1.0.0.patch | ( cd transformers-4.19.2 && patch -p1 ) )
!cd transformers-4.19.2 && pip install .
!pip install -r transformers-4.19.2/examples/pytorch/text-classification/requirements.txt
!pip install protobuf==3.19.1 tensorboard
import json
for f in ["train-v1.0.json","valid-v1.0.json"]:
with open("JGLUE/datasets/jsquad-v1.0/"+f,"r",encoding="utf-8") as r:
j=json.load(r)
u=[]
for d in j["data"]:
for p in d["paragraphs"]:
for q in p["qas"]:
u.append({"id":q["id"],"title":d["title"],"context":p["context"],"question":q["question"],"answers":{"text":[x["text"] for x in q["answers"]],"answer_start":[x["answer_start"] for x in q["answers"]]}})
with open(f,"w",encoding="utf-8") as w:
json.dump({"data":u},w,ensure_ascii=False,indent=2)
!python transformers-4.19.2/examples/pytorch/question-answering/run_qa.py --model_name_or_path KoichiYasuoka/deberta-base-japanese-aozora --do_train --do_eval --max_seq_length 384 --learning_rate 5e-05 --num_train_epochs 3 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --output_dir ./output_jsquad2 --overwrite_output_dir --train_file train-v1.0.json --validation_file valid-v1.0.json --save_steps 5000 --warmup_ratio 0.1
I've just been trying the program above on Google Colaboratory, but I'm vague that the conversion is really suitable for JSQuAD. @tomohideshibata -san, does [SEP]
in the jsquad-v1.0
files mean sep_token
or not?
Thank you for trying JGLUE.
For the first comment, the latest version, v4.19.2, can work. (We have updated the explanation for the huggingface versions via https://github.com/yahoojapan/JGLUE/commit/53e5ecd9dfa7bbe6d84f818d599bfb4393dd639d.)
For the second comment, we used examples/legacy/question-answering/run_squad.py
because examples/pytorch/question-answering/run_qa.py
supports only fast tokenizers (BertJapaneseTokenizer
does not have a fast version). We will check if run_qa.py
works with JSQuAD.
Does [SEP] in the jsquad-v1.0 files mean sep_token or not?
Yes.
Thank you @tomohideshibata -san for confirming transformers
v4.19.2. Here I realize that I need to replace [SEP] for another sep_token
when I evaluate another model whose sep_token
is not [SEP]. But... well... unless the sep_token
consists of 5 characters, I should change answer_start
, shoudn't I? Umm...
I should change answer_start, shoudn't I?
Yes. In the current version, sep_token
is hard-coded in the dataset.
One way to solve this problem is to calculate answer_start
in the evaluation script given sep_token
of a used tokenizer.
We will try this in the next version.
Thank you @tomohideshibata -san for the information about [SEP]. Well, I've just made tentative https://github.com/KoichiYasuoka/JGLUE/blob/main/fine-tuning/patch/transformers-4.19.2_jglue-1.0.0.patch for transformers
v4.19.2 where I included jsquad_metrics.py
instead of changing original squad_metrics.py
. But I couldn't include jsquad.py
since I couldn't find the proper way to force [SEP] as sep_token
in squad_convert_example_to_features()
and its neighbors...
We encountered a similar problem. examples/legacy/question-answering/run_squad.py
does not fit fast tokenizers well, our model can not run on this script even with setting use_fast=False
. So we tested examples/pytorch/question-answering/run_qa.py
, multilingual models and waseda roberta can run on this well but tohoku berts' tokenizer does not support this. The result of nlp-waseda/roberta-base-japanese
is as below(with out parameters optimizing), it seems to work fine as long as we can solve the tokenizer's problem.
EM | F1 |
---|---|
0.855 | 0.910 |
Thanks for reporting your results. We are also going to test run_qa.py
.
I also tried run_qa.py
(w/ trainer_qa.py
& utils_qa.py
) in transformers v4.19.2, but somehow an error occurred like this...
File "run_qa.py", line 661, in <module>
main()
File "run_qa.py", line 337, in main
answer_column_name = "answers" if "answers" in column_names else column_names[2]
IndexError: list index out of range
Hi @kaisugi -san, I needed some kind of conversion for run_qa.py
. My tentative script on Google Colaboratory below:
!test -d transformers-4.19.2 || git clone -b v4.19.2 --depth=1 https://github.com/huggingface/transformers transformers-4.19.2
!test -d JGLUE || ( git clone --depth=1 https://github.com/yahoojapan/JGLUE && cat JGLUE/fine-tuning/patch/transformers-4.9.2_jglue-1.1.0.patch | ( cd transformers-4.19.2 && patch -p1 ) )
!cd transformers-4.19.2 && pip install .
!pip install -r transformers-4.19.2/examples/pytorch/text-classification/requirements.txt
!pip install protobuf==3.19.1 tensorboard
import json
for f in ["train-v1.1.json","valid-v1.1.json"]:
with open("JGLUE/datasets/jsquad-v1.1/"+f,"r",encoding="utf-8") as r:
j=json.load(r)
u=[]
for d in j["data"]:
for p in d["paragraphs"]:
for q in p["qas"]:
u.append({"id":q["id"],"title":d["title"],"context":p["context"],"question":q["question"],"answers":{"text":[x["text"] for x in q["answers"]],"answer_start":[x["answer_start"] for x in q["answers"]]}})
with open(f,"w",encoding="utf-8") as w:
json.dump({"data":u},w,ensure_ascii=False,indent=2)
!python transformers-4.19.2/examples/pytorch/question-answering/run_qa.py --model_name_or_path KoichiYasuoka/deberta-base-japanese-aozora --do_train --do_eval --max_seq_length 384 --learning_rate 5e-05 --num_train_epochs 3 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --output_dir ./output_jsquad2 --overwrite_output_dir --train_file train-v1.1.json --validation_file valid-v1.1.json --save_steps 5000 --warmup_ratio 0.1
@KoichiYasuoka
I confirmed your patch script worked properly, Thanks!
Thank you for releasing JGLUE, but I could not evaluate my deberta-base-japanese-aozora. There seem two problems exist:
DeBERTaV2ForMultipleChoice
requirestransformers
v4.19.0 and after, but JGLUE requires v4.9.2DeBERTaV2TokenizerFast
) are not supported on JSQuAD with--use_fast_tokenizer
I tried to force v4.19.2 for the problems, but I could not resolve the latter. Please see detail in my diary (written in Japanese). Do you have any idea?