Open flexchar opened 11 months ago
Hi!
Thanks for the report. I can confirm that PHP implementation is less performant than the original library.
In the tiktoken library, the core logic is written in rust. I don't think you can get comparable performance from PHP.
$ phpbench run src/EncodeBench.php --report=aggregate --php-config='{"zend.assertions":-1}'
PHPBench (1.2.14) running benchmarks...
with configuration file: /var/bench/phpbench.json
with PHP version 8.1.24, xdebug ❌, opcache ❌
\Benchmark\EncodeBench
benchPHPImplementation..................I4 - Mo22.591ms (±1.66%)
benchRPCCounter.........................I4 - Mo4.339ms (±2.95%)
Subjects: 2, Assertions: 0, Failures: 0, Errors: 0
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| benchmark | subject | set | revs | its | mem_peak | mode | rstdev |
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| EncodeBench | benchPHPImplementation | | 100 | 5 | 18.606mb | 22.591ms | ±1.66% |
| EncodeBench | benchRPCCounter | | 100 | 5 | 18.606mb | 4.339ms | ±2.95% |
+-------------+------------------------+-----+------+-----+----------+----------+--------+
We need to further investigate the issue to understand whether optimization is possible.
In that case something like PHP-FFI with a native implementation in C++ could be more fair game! I see there are several implementations https://github.com/sewenew/tokenizer
I updated implementation, this allowed to speed up converting text into tokens ~ 2 times, for your example. You can check #10 for details.
That is extra ordinary work! I also had a thought that it could perhaps be possible by calling tiktoken written in C++ using PHP FFI. While I understand the overview, that is sadly far beyond my skillset.✌️
Hey, thanks for porting this over! I wanted to move to PHP to remove an extra dependency (docker server exposing Python TikToken over API). I decided to do a small benchmark and it seems that PHP version is greatly slower.
Source for Docker service: https://github.com/flexchar/tiktoken-counter
I use Laravel. I wrote a simple command to tokenize a 100 sentence long text 1000 times.
Median output is around:
Running PHP 8.2.10 on Docker on M2.