Closed tpritsky closed 4 months ago
Hi!
For the first question, the tokenizer will add the start token and the end token by default when you tokenize a sequence. If you convert the input ids back into tokens, the list would be like:
[" \
In esm models, these tokens usually represent nothing but align with the training strategy of bert. Also, the embedding of the "\
For the second question, actually the way I use for generating fixed length embeddings is based on huggingface model. If you check the function, you will find I just add a keyword argument to the inputs:
So you can adjust the function to obtain embeddings from other models.
Hope the answer above could resolve you questions and let me know if you have any other questions!
Thanks for your answer! This makes sense. As a sanity check, the embedding vector dimension is 480?
Thanks for your answer! This makes sense. As a sanity check, the embedding vector dimension is 480?
For 35M model, the embedding vector dimension is 480 and for 650M model, the dimension is 1280 :)
Thanks! One other point of feedback, I was facing out of memory errors (on a A100 GPU) until I added 'with torch.no_grad():' to the get_hidden_states function. I'm sure you considered this, but adding it here in case it helps anyone :)
Thank you for pointing that out! We did consider this problem. Actually we suggest you use 'with torch.no_grad()' manually instead of writing it in the function, e.g., like:
with torch.no_grad():
model.get_hidden_states()
This enables a more flexible way to handle with different situations, i.e. whether or not you need the resulted embeddings to be used to calculate gradients.
Thanks for this impressive solution!
When I run the tokenizer on an input sequence, there are always two additional elements added to the tokenizer output. Why is this and what do the values represent?
For example: print(len(sequence)) -> 5 inputs = tokenizer(sequence, return_tensors="pt") print(inputs_1['input_ids'].size()) -> torch.Size([1, 7])
Additionally, I'm trying to generate a fixed length sequence embedding. I saw you answered how to do this with the ESM model, but is there a way to do so with the huggingface model?
Thanks for your help!