Open xd009642 opened 1 month ago
As an aside note for perf the singleton models having a mutex around them seems counterproductive. Generally, with model code you want your model to either be:
Having mutexes around the singletons seems to drastically limit their utility given that the tokenisers are running on the CPU and not that complex.
Also for benchmarking you should look at criterion or divan, the rust benchmarking stuff won't be stable and provides less useful statistics/measurements
For some benchmarking results to back things up. We took the impl of num_tokens_from_message
and passed in an already loaded model to avoid reloading and we got:
├─ token_counting │ │ │ │ │
│ ├─ Azure35 2.299 ms │ 97.9 ms │ 4.02 ms │ 5.278 ms │ 100 │ 100
│ ├─ Azure4 2.048 ms │ 11.42 ms │ 4.429 ms │ 4.66 ms │ 100 │ 100
│ ├─ Llama2 2.373 ms │ 33.58 ms │ 5.256 ms │ 6.082 ms │ 100 │ 100
│ ├─ OpenAI35 3.006 ms │ 15.72 ms │ 8.08 ms │ 8.418 ms │ 100 │ 100
│ ╰─ OpenAI4 3.22 ms │ 30 ms │ 7.46 ms │ 8.057 ms │ 100 │ 100
Using the version in this library we get:
Timer precision: 25 ns
assistant_benches fastest │ slowest │ median │ mean │ samples │ iters
├─ token_counting │ │ │ │ │
│ ├─ Azure35 82.79 ms │ 174.3 ms │ 92.74 ms │ 97.51 ms │ 100 │ 100
│ ├─ Azure4 208.7 ms │ 694.2 ms │ 238.2 ms │ 274.8 ms │ 100 │ 100
│ ├─ Llama2 83.74 ms │ 123.6 ms │ 99.9 ms │ 99.07 ms │ 100 │ 100
│ ├─ OpenAI35 83.85 ms │ 125.5 ms │ 97.92 ms │ 98.48 ms │ 100 │ 100
│ ╰─ OpenAI4 208.9 ms │ 528.9 ms │ 241.8 ms │ 249.7 ms │ 100 │ 100
We've found that the
num_tokens_from_messages
function is a significant bottleneck in our application and from some benchmarking found that it seems to load the bpe every time it's called and this dominates the runtime. Is there a way to cache this so we can load it once and not take the pain of loading it each time or will it require a change to tiktoken-rs' internals (and if so what does the change have to be?)Section of a flamegraph showing the breakdown of
num_tokens_from_messages