whaleloops / KEPT

auto icd coding with prompt
MIT License
44 stars 16 forks source link

loss function labels batch size is not matching #9

Closed sribtc closed 1 month ago

sribtc commented 3 months ago

Hello i have been running your code on ICD 1O mimic3 dataset , i found that i have a token size of 85k+ , so which became my global attention window, my global window size greater than 501 and it is at 85k+

when i run the code for prediction it says that during computation of the loss function, the labels are 8921 and predicitons batch size being coming as 810 , there is a mismatch saying this error ValueError: Expected input batch_size (810) to match target batch_size (8921).

whaleloops commented 2 months ago

Sorry for the late response.

I don't think the model works for 85k tokens as model has a limitation of "max_position_embeddings": 16386, as specified by model config file.

The best practice is to use mamba: https://github.com/whaleloops/ClinicalMamba

sribtc commented 2 months ago

Hey Sorry, i ran it on ICD 9 codes only but mistakenly i said it as ICD 10, but even on ICD 9 my tokenization is producing 85K+ on mimic3 full dataset, do you know the reason why it is giving more tokens during the tokenization, and did you also faced the same problem while running if for mimic3 full dataset on rerank300 branch code, and even for 16386 max positional embeddings it not being matched to 8921 labels

whaleloops commented 1 month ago

The average number of tokens per discharge summary should be about 2k-4k. Maybe there exist a n uncommon case where its length is 85K+. We then truncate by the tokenizer function.