On our system we're seeing that any calls to num_tokens_from_messages is taking around 300-500ms consistently never mind the size of the message. Which is crazily slow for what it's doing. Upon checking a flamegraph we saw the model was being loaded each time so this attempts to fix that by using the singleton model instances and seeing if that can improve performance.
I'm a bit sceptical on this as the mutex locking might just cause it's own issues but we'll see in our testing and maybe come up with a solution going forwards...
As an aside given the vocab is known up front for a tokeniser you should be able to just codegen a hashmap if you wanted hashmap inserts completely dominate the num_tokens_from_messages.
On our system we're seeing that any calls to
num_tokens_from_messages
is taking around 300-500ms consistently never mind the size of the message. Which is crazily slow for what it's doing. Upon checking a flamegraph we saw the model was being loaded each time so this attempts to fix that by using the singleton model instances and seeing if that can improve performance.I'm a bit sceptical on this as the mutex locking might just cause it's own issues but we'll see in our testing and maybe come up with a solution going forwards...
As an aside given the vocab is known up front for a tokeniser you should be able to just codegen a hashmap if you wanted hashmap inserts completely dominate the
num_tokens_from_messages
.Experiment in service of #81