yya518 / FinBERT

A Pretrained BERT Model for Financial Communications. https://arxiv.org/abs/2006.08097
Apache License 2.0
560 stars 128 forks source link

RuntimeError: The size of tensor a (538) must match the size of tensor b (512) at non-singleton dimension 1 #31

Open j4ffle opened 2 years ago

j4ffle commented 2 years ago

I'm parsing conference calls and run into this error a couple of times. I used NLTK to parse the text components into sentences and then pass those sentences into the classifier following your example. It largely works, but I ran into this issue. From what I read, it arises because there are too many tokens (words) in the sentence. I manually inspect where I think the issue is occurring to identify a piece that is extra long. It occurs when there is a lot of semi-colons. So I could break up sentences with semi-colons, but that doesn't seem quite right. Using word_tokenize from nltk, there are only 488 tokens. How do you tokenize the words? I'm thinking I will truncate the sentence before passing to the model, but to do so accurately, I need to know how many tokens are created by the model.

Is my assessment of why this is happening correct and do you have a better solution than truncating? Thanks.