young-geng / EasyLM

Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.
Apache License 2.0
2.38k stars 254 forks source link

why 'LLaMATokenizer' object has no attribute 'sp_model'? #101

Open zepen opened 11 months ago

zepen commented 11 months ago

when i run command like this

python -m EasyLM.models.llama.llama_train \
--total_steps=10 \
--save_model_freq=10 \
--optimizer.adamw_optimizer.lr_warmup_steps=1 \
--train_dataset.json_dataset.path='/home/ec2-user/workplace/EasyLM/dataset/' \
--train_dataset.json_dataset.seq_length=1024 \
--load_checkpoint='params::/home/ec2-user/workplace/EasyLM/open_llama_7b_v2_easylm' \
--tokenizer.vocab_file='/home/ec2-user/workplace/EasyLM/open_llama_7b_v2_easylm/tokenizer.model' \
--logger.output_dir=checkpoint/  \
--mesh_dim='1,4,2' \
--load_llama_config='7b' \
--train_dataset.type='json' \
--train_dataset.text_processor.fields='text' \
--optimizer.type='adamw' \
--optimizer.accumulate_gradient_steps=1 \
--optimizer.adamw_optimizer.lr=0.002 \
--optimizer.adamw_optimizer.end_lr=0.002 \
--optimizer.adamw_optimizer.lr_decay_steps=100000000 \
--optimizer.adamw_optimizer.weight_decay=0.001 \
--optimizer.adamw_optimizer.multiply_by_parameter_scale=True \
--optimizer.adamw_optimizer.bf16_momentum=True

logs as below:

wandb: Tracking run with wandb version 0.15.12
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
2023-10-24 13:53:33.481541: W external/xla/xla/service/gpu/nvptx_compiler.cc:673] The NVIDIA driver's CUDA version is 12.0 which is older than the ptxas CUDA version (12.3.52). Because the driver is older than the ptxas version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ec2-user/workplace/EasyLM/EasyLM/models/llama/llama_train.py", line 267, in <module>
    mlxu.run(main)
  File "/home/ec2-user/miniconda3/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/ec2-user/miniconda3/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/home/ec2-user/workplace/EasyLM/EasyLM/models/llama/llama_train.py", line 64, in main
    tokenizer = LLaMAConfig.get_tokenizer(FLAGS.tokenizer)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/workplace/EasyLM/EasyLM/models/llama/llama_model.py", line 293, in get_tokenizer
    tokenizer = LLaMATokenizer(
                ^^^^^^^^^^^^^^^
  File "/home/ec2-user/workplace/EasyLM/EasyLM/models/llama/llama_model.py", line 1140, in __init__
    super().__init__(bos_token=bos_token, eos_token=eos_token, unk_token=unk_token, **kwargs)
  File "/home/ec2-user/miniconda3/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 366, in __init__
    self._add_tokens(self.all_special_tokens_extended, special_tokens=True)
  File "/home/ec2-user/miniconda3/lib/python3.11/site-packages/transformers/tokenization_utils.py", line 462, in _add_tokens
    current_vocab = self.get_vocab().copy()
                    ^^^^^^^^^^^^^^^^
  File "/home/ec2-user/workplace/EasyLM/EasyLM/models/llama/llama_model.py", line 1175, in get_vocab
    vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
                                                             ^^^^^^^^^^^^^^^
  File "/home/ec2-user/workplace/EasyLM/EasyLM/models/llama/llama_model.py", line 1163, in vocab_size
    return self.sp_model.get_piece_size()
           ^^^^^^^^^^^^^
AttributeError: 'LLaMATokenizer' object has no attribute 'sp_model'
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync checkpoint/27fc482119cd4211965c651f185f0aa6/wandb/offline-run-20231024_135326-27fc482119cd4211965c651f185f0aa6
wandb: Find logs at: checkpoint/27fc482119cd4211965c651f185f0aa6/wandb/offline-run-20231024_135326-27fc482119cd4211965c651f185f0aa6/logs

it' s seem tokenizer.vocab_file is incorrect,but i don't know which file should be use.

zepen commented 11 months ago

No one answer ?

juliensalinas commented 11 months ago

I fixed the problem by downgrading transformers to 4.33.0 (pip install -U transformers==4.33.0)

JhonatMiranda commented 7 months ago

It worked to me! Thank you @juliensalinas