BPE memory leak - Githubissues

zurawiki / tiktoken-rs

Ready-made tokenizer library for working with GPT and tiktoken

MIT License

240 stars 47 forks source link

BPE memory leak #39

Open venual opened 11 months ago

venual commented 11 months ago

I dont know if im using it wrong but when creating a new BPE it creates around 20MB of memory and never releases it, on top of that the async_openai::get_max_tokens_chat_message function creates a new bpe in it so big memory usage that never releases after every call

zurawiki commented 11 months ago

Interesting find! Do you have a code snippet we can reproduce this issue with?

zurawiki commented 11 months ago

Hi @venual, I'm following up to see if this is still an issue

Sytten commented 10 months ago

We also see this in production, investigating @zurawiki

There should be a way to re-use the BPE for sure in any case.

Sytten commented 10 months ago

We did some investigation and the lib does a lot of allocations, specially around regexes (fancy_regex) in _encode_native. Given a 1-2MB input it can easily bubble up to 50mb of RAM usage. Just looking at the code I see a couple of quick wins (example: the transformation of messages from OpenAI types to your types are clones instead of a ref). For sure it would be better to use a single BPE instance too, but I don't think this is the primary memory problem. We used the lib to estimate the cost of user provided requests before sending them to openai.

zurawiki commented 10 months ago

Thanks for the analysis @Sytten.

Is there a specific code snipped we can use to build a regression test to make sure memory stays under a reasonable amount? How are you testing memory usage?

For follow-ups, it looks like we should:

[ ] Fix fancy_regex's memory usage in _encode_native
[ ] Reduce uses of clones in the transformation of OpenAI types
[ ] Only use a single BPE instance

Sytten commented 10 months ago

We did mainly manual tests with a codepath that had tiktoken::num_tokens_from_messages. I used dhat (https://docs.rs/dhat/latest/dhat/) and you can build automated memory tests with it. What we have seen is that it becomes "exponential" the more data you feed it, so you will easily see the RAM usage spike in an activity monitor when you feed it 300kb. Even if in theory you should not pass the openai service that much data, I would still use that as a benchmark to diff the memory usage.

zurawiki commented 10 months ago

So from an initial analysis (see the dhat-heap file below), I confirm time is spent in _encode_native specifically around matching the fancy_regex.

dhat-heap.json

There are performance notes noting that regex takes a fair bit of CPU. I noticed that the regex used in the original tiktoken library used the negative lookahead operator (?! which can create performance problems over time. https://github.com/zurawiki/tiktoken-rs/blob/4afb9d35afa4743a3a095da1bfd61ee31a71f38d/tiktoken-rs/src/tiktoken_ext/openai_public.rs#L116

We could try to switch to the regex rust crate since we don't necessarily have the threading issues that the Python interop has, but we would need to use a regex without look-around.

Note that in case you want to profile memory tests, I pushed a commit with an example that can be run with

cargo run --example num_tokens_memory --features dhat-heap --release