wonjune-kang / llm-speech-summarization

Prompting Large Language Models with Audio for General-Purpose Speech Summarization
https://arxiv.org/abs/2406.05968
MIT License
5 stars 1 forks source link

Mismatched Tensor Sizes #3

Closed tylersbarlow closed 1 month ago

tylersbarlow commented 1 month ago

Hello! I arrived here from your paper in InterSpeech and am very impressed by what you have been able to do with Speech Summarization!

I had a question involving the code. I am trying to train a model from scratch, basically following your code provided in the GitHub. However, I am curious about the self.llm.generate() function that is called in both the trainer.py (during validation) and inference.py (line 441 in trainer.py and 100 in inference.py). Each time it gets run, I get an error saying the tensors don't match in sizes. The input_ids being passed in are set to None, but I think the function is expecting the ids. What was the reason behind passing in None instead of the ids, and is there a way you know to get around the error to be able to run inference?

I am fairly inexperienced in this area, but think the concept is very interesting, which is why I was looking to reproduce it. Any explanations would be greatly appreciated!

Thanks!

Here is the error for reference: RuntimeError: The size of tensor a (0) must match the size of tensor b (94) at non-singleton dimension 2

wonjune-kang commented 1 month ago

There are several ways of passing in inputs to HuggingFace's generate function. The standard way is to pass in input_ids, which is what you'd normally do if you're using the LLM's tokenizer; this feeds in token IDs, which are then internally converted to embeddings before being fed into the model. But you can also pass in inputs_embeds instead, where you feed in the embeddings directly (see the Llama model code in the transformers repo).

We need to feed in inputs_embeds rather than input_ids because the audio tokens produced by the audio encoder don't correspond to any discrete token ID from the LLM's tokenizer. Note that input_ids should be set to None if inputs_embeds is passed in this way.

Could you provide the exact code/command(s) you ran to produce the tensor mismatch error you got?

tylersbarlow commented 1 month ago

Thanks for the explanation! I initially preprocess the data by running:

python preprocess_data/preprocess.py

exactly how the README says. The only think I changed in the preprocess.py was the gpu_idx on line 25 from 1 to 0 and the file path save location.

Then I ran:

RUN_NAME="full_training_run" CONFIG_FILE="config/config_full.yaml" GPU_IDX=0

python -u train.py -c $CONFIG_FILE -g $GPU_IDX -n $RUN_NAME

In the config file, I changed the Base Path, as well as only putting "librispeech_train.clean.100_preprocessed.hf" in train_set and only "librispeech_validation.clean_preprocessed.hf" in val_set to avoid having to process the entirety of the data until I can make the code work. I changed nothing in train.py, but did make some edits in trainer.py. I had to add

audio_attention_mask = audio_attention_mask.to(self.device) text_attention_mask = text_attention_mask.to(self.device)

after line 228 or else I would get an error that some tensors were on cpu and some on gpu (this addition allowed the code to keep running, otherwise I couldn't even make it to the epoch initialization). I returned the tensors back to "cpu" using

audio_attention_mask = audio_attention_mask.to("cpu") and text_attention_mask = text_attention_mask.to("cpu")

after they were used in their respective function calls. For clarity, I also uncommented the debugging lines to make it go faster. Other than that, everything is the same as it is in your GitHub. The code loops all the way until the validate function, where when self.llm.generate() is run on line 440, I get the output error from my first message. If I initialize a tensor of random integers the same length as the size of inputs_embeds and pass that for input_ids, the code functions correctly, it just returns gibberish.

Let me know if you have any other questions. Thanks for the help!

wonjune-kang commented 1 month ago

I'll try to take a look at this ASAP (probably over the weekend); most likely what is happening is there's a batch or tensor dimension error from how the audio is loaded by the dataloader vs. how the generate function expects the input. I had to clean up the preprocessing code quite a bit before I committed it, and I might have introduced some errors there -- it's probably not being preprocessed in the exact same way as I did it originally.

In the meantime, could you try printing the tensor dimension of inputs_embeds and seeing what it is? I believe it should be something like [batch_size, seq_length, 3072]. Also, have you tried running inference using the pre-trained audio encoder only, and do you run into the same error there?

tylersbarlow commented 1 month ago

Great, that would be super helpful. The dimensions are [1, 94, 3072].

Unfortunately, I don't have access to Google Drive, which is why I'm training it from scratch.

wonjune-kang commented 1 month ago

I tried re-running preprocessing on a subset of the Librispeech data and running the training script, and I didn't run into any of the issues you found, including the device issue for the attention mask or the issue with the generate call. I'm also seeing inputs_embeds having dimensions [1, seq_length, 3072], and generate runs fine.

Are you running on the exact dependency versions specified in requirements.txt? HuggingFace's transformers library is updated pretty frequently, so there's a chance that there is some mismatch in expected inputs if you're on a different version.

tylersbarlow commented 1 month ago

The dependencies is a good point to bring up. Again, I'm fairly new at all of this stuff so I'm not sure how much this would change, but when I run requirements.txt, I get an error that the versions of:

torch==2.0.0+cu117 torchaudio==2.0.1+cu117

cannot be found, so I ended up just pip installing torch 2.0.0 and torchaudio 2.0.1. In addition, after installing the correct dependencies of everything else in requirements.txt, when I run the train.py, I get an error that says

"LlamaTokenizer requires the SentencePiece library but it was not found in your environment."

so I pip install sentencepiece as well. After all that though, it does seem to be working correctly, so something must have been wrong with my dependencies earlier. Sorry for the hassle but thank you for the prompt response!

wonjune-kang commented 1 month ago

Great, and thanks for pointing out the missing sentencepiece dependency that was missing from the requirements file; I've just updated it. Glad you were able to get things working!