yeyupiaoling / Whisper-Finetune

Fine-tune the Whisper speech recognition model to support training without timestamp data, training with timestamp data, and training without speech data. Accelerate inference and support Web deployment, Windows desktop deployment, and Android deployment
Apache License 2.0
813 stars 129 forks source link

微调在WhisperProcessor.from_pretrained调用时就报错 #42

Closed lichq5 closed 8 months ago

lichq5 commented 9 months ago

我使用单卡训练,一启动就报错: Traceback (most recent call last): File "/workspace/Whisper-Finetune-master/finetune.py", line 47, in processor = WhisperProcessor.from_pretrained(args.base_model, File "/opt/conda/lib/python3.10/site-packages/transformers/processing_utils.py", line 228, in from_pretrained args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/processing_utils.py", line 272, in _get_arguments_from_pretrained args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, kwargs)) File "/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained return cls._from_pretrained( File "/opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2249, in _from_pretrained init_kwargs[key] = added_tokens_map.get(init_kwargs[key], init_kwargs[key]) TypeError: unhashable type: 'dict' 这个是怎么回事,是哪里搞错了吗?

yeyupiaoling commented 9 months ago

这有可能是你下载的模型文件不完整。或者是错的。

lichq5 commented 9 months ago

我把openai/whisper-small/的[flax_model.msgpack][model.safetensors][pytorch_model.bin][tf_model.h5]四个模型都下载下来了,都不行,这是为什么。没有md5也没法校验是否不一致,但下载过程都没有报错

yeyupiaoling commented 8 months ago

@lichq5 不止这几个文件,还有很多文件的

lichq5 commented 8 months ago

我现在在训练的时候会报这个错: raise ValueError( "Asking to pad but the tokenizer does not have a padding token. " "Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) " "or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})." ) 如果我手动修改源码,加上self.pad_token="[PAD]"这个代码,会影响训练效果吗

yeyupiaoling commented 8 months ago

这样应该是不行的。 你还是要下载完整的文件去读取里面的token