Performance - Githubissues

mrahmadt commented 10 months ago

Hello

I noticed that the performance of the below code is prolonged compared to https://platform.openai.com/tokenizer

   protected static $encoding_name = 'cl100k_base';

    public function countTokens($content){
        $provider = new EncoderProvider();
        $encoder = $provider->get(self::$encoding_name);
        $tokens = $encoder->encode($content);
        return count($tokens);
    }

The above code will take 3 seconds to return total tokens compared to https://platform.openai.com/tokenizer (less than a second) when I run to use it with this text MSGSphere.txt

Is there any way to optimize the speed?

mrahmadt commented 10 months ago

Hello All

for anyone looking for a solution, the problem we didn't set a caching directory, @yethee Kindly, can you mention this in the README file

    public function countTokens($content){
        $provider = new EncoderProvider();
        $provider->setVocabCache(storage_path('encoders'));
        $encoder = $provider->get(self::$encoding_name);
        $tokens = $encoder->encode($content);
        return count($tokens);
    }

yethee commented 10 months ago

Thanks for feedback!

I've updated the README. And since 0.2.0 the cache is enabled by default.

yethee / tiktoken-php

Performance #5