yethee / tiktoken-php

This is a port of the tiktoken
MIT License
104 stars 23 forks source link

Large memory usage? #21

Open Vimiso opened 4 weeks ago

Vimiso commented 4 weeks ago

Take the given test:

$usage = memory()[1];

$provider = new \Yethee\Tiktoken\EncoderProvider;
$provider->setVocabCache(storage_path('app'));
$encoder = $provider->getForModel('gpt-4o-mini');

dd(memory()[1]-$usage); // 26mb! 

26mb seems a bit much no? Especially considering the cached vocab is only 3.6mb.

yethee commented 1 week ago

The token dictionary takes up most of the allocated memory. We need to keep the entire dictionary in memory so that encoding text into tokens and vice versa is efficient. Currently, the built-in array type is used for this. I have no idea how to reduce the amount of memory consumed in this place.

Profile ```php get(''); ``` Top of memory usage: [Vocab::fromStream()](https://github.com/yethee/tiktoken-php/blob/16fa1045ce83375db88d74d503bbc72ff0d86c9e/src/Vocab/Vocab.php#L86-L104) ### Encoding: cl100k_base ``` *** SPX Report *** Global stats: Called functions : 81 Distinct functions : 50 Wall time : 161.9ms ZE memory usage : 11.8MB Flat profile: Wall time | ZE memory usage | Inc. | *Exc. | Inc. | Exc. | Called | Function ----------+----------+----------+----------+----------+---------- 70.2ms | 59.0ms | 432.2KB | 418.5KB | 12 | {closure} 42.1ms | 38.2ms | 10.8MB | 8.8MB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromStream 78.5ms | 5.9ms | 839.8KB | 363.7KB | 1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::getLoader 5.0ms | 5.0ms | 120B | 120B | 1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash 4.0ms | 4.0ms | 2.0MB | 2.0MB | 1 | Yethee\Tiktoken\Vocab\Vocab::__construct 2.4ms | 2.4ms | 43.0KB | 43.0KB | 1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader 29.9us | 29.9us | 0B | 0B | 1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php 42.1ms | 19.4us | 10.8MB | -8.0KB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromFile 15.4us | 15.4us | 424B | 424B | 1 | Composer\Autoload\ClassLoader::initializeIncludeClosure 5.7ms | 11.7us | 592B | 0B | 6 | Composer\Autoload\ClassLoader::findFile ``` ### Encoding: o200k_base ``` *** SPX Report *** Global stats: Called functions : 81 Distinct functions : 50 Wall time : 202.1ms ZE memory usage : 22.7MB Flat profile: Wall time | ZE memory usage | Inc. | *Exc. | Inc. | Exc. | Called | Function ----------+----------+----------+----------+----------+---------- 84.6ms | 76.1ms | 21.8MB | 17.8MB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromStream 16.4ms | 14.6ms | 64.9KB | 65.1KB | 6 | 1@Composer\Autoload\{closure} 10.8ms | 10.8ms | 120B | 120B | 1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash 8.5ms | 8.5ms | 4.0MB | 4.0MB | 1 | Yethee\Tiktoken\Vocab\Vocab::__construct 2.0ms | 2.0ms | 43.0KB | 43.0KB | 1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader 31.9us | 31.9us | 0B | 0B | 1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php 84.7ms | 23.8us | 21.8MB | -8.0KB | 1 | Yethee\Tiktoken\Vocab\Vocab::fromFile 5.5ms | 10.6us | 592B | 0B | 6 | Composer\Autoload\ClassLoader::findFile 6.8us | 6.8us | 48B | 48B | 1 | Yethee\Tiktoken\EncoderProvider::__construct 106.4ms | 6.1us | 21.8MB | 432B | 1 | Yethee\Tiktoken\EncoderProvider::getVocab ```