zurawiki / tiktoken-rs

Ready-made tokenizer library for working with GPT and tiktoken
MIT License
240 stars 47 forks source link

Is there a way to cache the bpe when using `num_tokens_from_messages` #81

Open xd009642 opened 1 month ago

xd009642 commented 1 month ago

We've found that the num_tokens_from_messages function is a significant bottleneck in our application and from some benchmarking found that it seems to load the bpe every time it's called and this dominates the runtime. Is there a way to cache this so we can load it once and not take the pain of loading it each time or will it require a change to tiktoken-rs' internals (and if so what does the change have to be?)

Section of a flamegraph showing the breakdown of num_tokens_from_messages

image

xd009642 commented 1 month ago

As an aside note for perf the singleton models having a mutex around them seems counterproductive. Generally, with model code you want your model to either be:

  1. Callable from multiple threads/contexts with any mutable state stored in a separate type so the model type remains immutable
  2. Cheap to clone (usually via reference counted params) so if mutable state can't be separated the model is reusable

Having mutexes around the singletons seems to drastically limit their utility given that the tokenisers are running on the CPU and not that complex.

Also for benchmarking you should look at criterion or divan, the rust benchmarking stuff won't be stable and provides less useful statistics/measurements

xd009642 commented 1 month ago

For some benchmarking results to back things up. We took the impl of num_tokens_from_message and passed in an already loaded model to avoid reloading and we got:

├─ token_counting                     │               │               │               │         │
│  ├─ Azure35           2.299 ms      │ 97.9 ms       │ 4.02 ms       │ 5.278 ms      │ 100     │ 100
│  ├─ Azure4            2.048 ms      │ 11.42 ms      │ 4.429 ms      │ 4.66 ms       │ 100     │ 100
│  ├─ Llama2            2.373 ms      │ 33.58 ms      │ 5.256 ms      │ 6.082 ms      │ 100     │ 100
│  ├─ OpenAI35          3.006 ms      │ 15.72 ms      │ 8.08 ms       │ 8.418 ms      │ 100     │ 100
│  ╰─ OpenAI4           3.22 ms       │ 30 ms         │ 7.46 ms       │ 8.057 ms      │ 100     │ 100

Using the version in this library we get:

Timer precision: 25 ns
assistant_benches       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ token_counting                     │               │               │               │         │
│  ├─ Azure35           82.79 ms      │ 174.3 ms      │ 92.74 ms      │ 97.51 ms      │ 100     │ 100
│  ├─ Azure4            208.7 ms      │ 694.2 ms      │ 238.2 ms      │ 274.8 ms      │ 100     │ 100
│  ├─ Llama2            83.74 ms      │ 123.6 ms      │ 99.9 ms       │ 99.07 ms      │ 100     │ 100
│  ├─ OpenAI35          83.85 ms      │ 125.5 ms      │ 97.92 ms      │ 98.48 ms      │ 100     │ 100
│  ╰─ OpenAI4           208.9 ms      │ 528.9 ms      │ 241.8 ms      │ 249.7 ms      │ 100     │ 100