Open venual opened 11 months ago
Interesting find! Do you have a code snippet we can reproduce this issue with?
Hi @venual, I'm following up to see if this is still an issue
We also see this in production, investigating @zurawiki
There should be a way to re-use the BPE for sure in any case.
We did some investigation and the lib does a lot of allocations, specially around regexes (fancy_regex) in _encode_native. Given a 1-2MB input it can easily bubble up to 50mb of RAM usage. Just looking at the code I see a couple of quick wins (example: the transformation of messages from OpenAI types to your types are clones instead of a ref). For sure it would be better to use a single BPE instance too, but I don't think this is the primary memory problem. We used the lib to estimate the cost of user provided requests before sending them to openai.
Thanks for the analysis @Sytten.
Is there a specific code snipped we can use to build a regression test to make sure memory stays under a reasonable amount? How are you testing memory usage?
For follow-ups, it looks like we should:
_encode_native
We did mainly manual tests with a codepath that had tiktoken::num_tokens_from_messages
.
I used dhat (https://docs.rs/dhat/latest/dhat/) and you can build automated memory tests with it.
What we have seen is that it becomes "exponential" the more data you feed it, so you will easily see the RAM usage spike in an activity monitor when you feed it 300kb. Even if in theory you should not pass the openai service that much data, I would still use that as a benchmark to diff the memory usage.
So from an initial analysis (see the dhat-heap file below), I confirm time is spent in _encode_native specifically around matching the fancy_regex
.
There are performance notes noting that regex takes a fair bit of CPU. I noticed that the regex used in the original tiktoken library used the negative lookahead operator (?!
which can create performance problems over time.
https://github.com/zurawiki/tiktoken-rs/blob/4afb9d35afa4743a3a095da1bfd61ee31a71f38d/tiktoken-rs/src/tiktoken_ext/openai_public.rs#L116
We could try to switch to the regex
rust crate since we don't necessarily have the threading issues that the Python interop has, but we would need to use a regex without look-around.
Note that in case you want to profile memory tests, I pushed a commit with an example that can be run with
cargo run --example num_tokens_memory --features dhat-heap --release
I dont know if im using it wrong but when creating a new BPE it creates around 20MB of memory and never releases it, on top of that the async_openai::get_max_tokens_chat_message function creates a new bpe in it so big memory usage that never releases after every call