yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
825 stars 138 forks source link

How to load a custom trained BERT model from disk #91

Closed MinionAttack closed 2 years ago

MinionAttack commented 2 years ago

Hi,

I am trying to load a BERT model trained with SuPar but I am not able to do it because it tries to download it from HugginFace and gives an error.

First I train a model using a HugginFace identifier:

python -m supar.cmds.biaffine_sdp train --build --device 1 --conf config/Basque/berteus-base-cased.ini \
    --encoder bert --bert ixa-ehu/berteus-base-cased --unk '' \
    --train data/Corpus/Universal_Dependencies/Basque/BDT/train.conllu \
    --dev data/Corpus/Universal_Dependencies/Basque/BDT/dev.conllu \
    --test data/Corpus/Universal_Dependencies/Basque/BDT/test.conllu \
    --path models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline
---------------------+-------------------------------
Param                |             Value             
---------------------+-------------------------------
encoder              |              bert             
bert                 |   ixa-ehu/berteus-base-cased  
n_bert_layers        |               4               
mix_dropout          |              0.0              
bert_pooling         |              mean             
encoder_dropout      |              0.1              
n_edge_mlp           |              600              
n_label_mlp          |              600              
edge_mlp_dropout     |              0.25             
label_mlp_dropout    |              0.33             
interpolation        |              0.1              
lr                   |             5e-05             
lr_rate              |               1               
clip                 |              5.0              
min_freq             |               7               
fix_len              |               20              
epochs               |               3               
warmup               |              0.1              
batch_size           |              1000             
update_steps         |               1               
mode                 |             train             
path                 | models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline
device               |               1               
seed                 |               1               
threads              |               16              
local_rank           |               -1              
feat                 |              None             
build                |              True             
checkpoint           |             False             
max_len              |              None             
buckets              |               32              
train                | data/Corpus/Universal_Dependencies/Basque/BDT/train.conllu
dev                  | data/Corpus/Universal_Dependencies/Basque/BDT/dev.conllu
test                 | data/Corpus/Universal_Dependencies/Basque/BDT/test.conllu
embed                |     data/glove.6B.100d.txt    
unk                  |               ''              
n_embed              |              100              
n_embed_proj         |              125              
---------------------+-------------------------------

And the model it's saved here:

models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline

So now I want to load that model and train it again to do some experiments:

python -m supar.cmds.biaffine_sdp train --build --device 1 --conf config/Basque/berteus-base-cased.ini \
    --encoder bert --bert models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline --unk '' \
    --train data/Corpus/Experiment/Basque/train.conllu \
    --dev data/Corpus/Experiment/Basque/dev.conllu \
    --test data/Corpus/Experiment/Basque/test.conllu \
    --path models/Experiment/Basque/Model_berteus-base-cased_1_finetuned
---------------------+-------------------------------
Param                |             Value             
---------------------+-------------------------------
encoder              |              bert             
bert                 | models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline
n_bert_layers        |               4               
mix_dropout          |              0.0              
bert_pooling         |              mean             
encoder_dropout      |              0.1              
n_edge_mlp           |              600              
n_label_mlp          |              600              
edge_mlp_dropout     |              0.25             
label_mlp_dropout    |              0.33             
interpolation        |              0.1              
lr                   |             5e-05             
lr_rate              |               1               
clip                 |              5.0              
min_freq             |               7               
fix_len              |               20              
epochs               |               3               
warmup               |              0.1              
batch_size           |              1000             
update_steps         |               1               
mode                 |             train             
path                 | models/Experiment/Basque/Model_berteus-base-cased_1_finetuned
device               |               1               
seed                 |               1               
threads              |               16              
local_rank           |               -1              
feat                 |              None             
build                |             False             
checkpoint           |             False             
max_len              |              None             
buckets              |               32              
train                | data/Corpus/Experiment/Basque/train.conllu
dev                  | data/Corpus/Experiment/Basque/dev.conllu
test                 | data/Corpus/Experiment/Basque/test.conllu
embed                |     data/glove.6B.100d.txt    
unk                  |               ''              
n_embed              |              100              
n_embed_proj         |              125              
---------------------+-------------------------------

But all the time it tries to download it from HugginFace instead of loading it from disk.

2022-01-25 15:55:34 INFO Building the fields
Traceback (most recent call last):
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 561, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 650, in _dict_from_json_file
    text = reader.read()
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/iago/.local/share/JetBrains/IntelliJIdea2021.3/python/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/iago/.local/share/JetBrains/IntelliJIdea2021.3/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/supar/cmds/biaffine_sdp.py", line 43, in <module>
    main()
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/supar/cmds/biaffine_sdp.py", line 39, in main
    parse(parser)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/supar/cmds/cmd.py", line 28, in parse
    parser = Parser.load(**args) if args.checkpoint else Parser.build(**args)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/supar/parsers/sdp.py", line 233, in build
    t = AutoTokenizer.from_pretrained(args.bert)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 470, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 558, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 583, in get_config_dict
    raise EnvironmentError(msg)
OSError: Couldn't reach server at 'models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline' to download configuration file or configuration file is not a valid JSON file. Please check network or file content here: models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline.

What I'm doing wrong? I have tried with and without the --build parameter but I get the same result.

How can I load a previously trained model that is on disk?

Regards.

yzhangcs commented 2 years ago

@MinionAttack Hi, have you checked that the model you provide is indeed loadable?

>>> AutoModel.from_pretrained('models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline')
MinionAttack commented 2 years ago

Hi @yzhangcs,

I'm using SuPar from command line but I have tried your code in a terminal with the full path of the model:

(venv) iago@zape:~/Escritorio/SuPar_Pre-finetuning$ python3
Python 3.9.10 (main, Jan 15 2022, 18:56:52) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import supar
>>> from transformers import AutoModel
>>> AutoModel.from_pretrained('/home/iago/Escritorio/SuPar_Pre-finetuning/models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline')
Traceback (most recent call last):
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 561, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 650, in _dict_from_json_file
    text = reader.read()
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 396, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 558, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/iago/Escritorio/SuPar_Pre-finetuning/venv/lib/python3.9/site-packages/transformers/configuration_utils.py", line 583, in get_config_dict
    raise EnvironmentError(msg)
OSError: Couldn't reach server at '/home/iago/Escritorio/SuPar_Pre-finetuning/models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline' to download configuration file or configuration file is not a valid JSON file. Please check network or file content here: /home/iago/Escritorio/SuPar_Pre-finetuning/models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline.
>>> 

And I get the same error, it tries to download it from HugginFace.

For clarification, I'm using: BiaffineSemanticDependencyParser

yzhangcs commented 2 years ago

@MinionAttack Sorry, I misunderstood your question. So let me make some clarifications: what you intend to do is some sort of continual learning, right? Actually in this case, all you need to do is to remove the --build option, i.e., avoid building the model from scratch, and then keep all other options the same as before. If you do not wish to reload all well-trained params except BERT, you might need to save BERT params individually (rather than the whole model) to somewhere and then specify --bert to this path.

MinionAttack commented 2 years ago

@yzhangcs Yes, maybe I did not explain very well... sorry. What I'm trying to do is train a SDP model and save that model, then load that model and train it again but with another train/dev/test files and try to finetune it.

I have removed the --build parameter and I get the same error. Debugging the code I have found that:

In supar/cmds/biaffine_sdp.py at line 39 calls parse(parser) from supar.cmds.cmd.py and in that method (lines 27 to 29):

if args.mode == 'train':
        parser = Parser.load(**args) if args.checkpoint else Parser.build(**args)
        parser.train(**args)

Unless I specify the --checkpoint argument the code will always build the model. Following the code, because I'm using a BiaffineSemanticDependencyParser instance the build method is called from supar/parsers/sdp.py and in the line 221:

if os.path.exists(path) and not args.build:

The path variable has models/Experiment/Basque/Model_berteus-base-cased_1_finetuned (the new model I want to save) as value instead of the model models/Universal_Dependencies/Basque/BDT/Model_berteus-base-cased_1_baseline which it's the one I'm trying to load. Because of that the check fails and the code goes on and the error it's thrown at line 233:

t = AutoTokenizer.from_pretrained(args.bert).

Because of this, I'm a bit confuse and I'm not quite sure of understand how SuPar works. Why SuPar tries to load the model to be created during training instead of the one specified with the --bert parameter? If SuPar tries to load the model from the one to be created during training, where will the model to be created during training be saved?

I apologise in advance if my head is a bit thick and I don't see how to use it properly.

yzhangcs commented 2 years ago

@MinionAttack Thank you for your thoughtful trials, which recalls some details for me. It seems that my current code is not very well adapted to continual learning. --checkpoint is designed to restore the interrupted training process, rather than training the model again. You might not be able to meet your needs using the code of supar/cmds/biaffine_sdp.py. I would recommend you to write some scripts yourself to implement:

  1. initialize model params, optimizer and schedulers (can be completed by Parser.build).
  2. reload the params of previous trained model (some hacky code to replace the initialized params of parser.model).

I'm sorry if some of my poorly thought out code has confused you.

MinionAttack commented 2 years ago

No problem, thanks for your recommendation! And your code is not bad, it's very good as I was able to trace it very quickly :)