yethee / tiktoken-php

This is a port of the tiktoken
MIT License
100 stars 22 forks source link

Is PHP cursed to be much slower? #6

Open flexchar opened 11 months ago

flexchar commented 11 months ago

Hey, thanks for porting this over! I wanted to move to PHP to remove an extra dependency (docker server exposing Python TikToken over API). I decided to do a small benchmark and it seems that PHP version is greatly slower.

Source for Docker service: https://github.com/flexchar/tiktoken-counter

I use Laravel. I wrote a simple command to tokenize a 100 sentence long text 1000 times.

Median output is around:

Docker time: 4.5049350261688 seconds
PHP time: 20.138854026794 seconds
<?php

namespace App\Console\Commands;

use Illuminate\Console\Command;
use Illuminate\Support\Facades\Http;

class BenchmarkTikToken extends Command
{
    /**
     * The name and signature of the console command.
     *
     * @var string
     */
    protected $signature = 'app:benchmark-tik-token';

    /**
     * The console command description.
     *
     * @var string
     */
    protected $description = 'Benchmark PHP version of TikToken vs. Python using Docker image';

    // Store initialized tokenizer
    public \Yethee\Tiktoken\Encoder $encoder;

    /**
     * Execute the console command.
     */
    public function handle(): void
    {
        $this->warn('Make sure to `composer require yethee/tiktoken`.');

        $timesToIterate = 1000;
        $text = Http::get(
            'https://baconipsum.com/api/?type=meat-and-filler&paras=100&format=text',
        )
            ->throw()
            ->body();

        // Warm up the functions
        $provider = app(\Yethee\Tiktoken\EncoderProvider::class);
        $this->encoder = $provider->getForModel('gpt-4');
        $this->countTokens('hello world');
        $this->countTokensPhp('hello world');

        // Benchmark the functions
        $countTokensTime = $this->benchmark(function () use ($text, $timesToIterate) {
            foreach (range(1, $timesToIterate) as $_iteration) {
                $this->countTokens($text);
            }
        });

        $countTokensPhpTime = $this->benchmark(function () use ($text, $timesToIterate) {
            foreach (range(1, $timesToIterate) as $_iteration) {
                $this->countTokensPhp($text);
            }
        });

        // Print the results
        $this->line("Docker time: {$countTokensTime} seconds");
        $this->line("PHP time: {$countTokensPhpTime} seconds");
    }

    private function benchmark(callable $function): float
    {
        $start = microtime(true);
        $function();
        $end = microtime(true);

        return $end - $start;
    }

    public function countTokensPhp(string $text): int
    {
        $tokens = $this->encoder->encode($text);

        return count($tokens);
    }

    public function countTokens(string $text): int
    {
        $tokens = Http::post('tiktoken:8000/count', [
            'text' => $text,
        ])
            ->throw()
            ->json('tokens');

        return (int) ceil($tokens * 1.05);
    }
}

Running PHP 8.2.10 on Docker on M2.

yethee commented 11 months ago

Hi!

Thanks for the report. I can confirm that PHP implementation is less performant than the original library.

In the tiktoken library, the core logic is written in rust. I don't think you can get comparable performance from PHP.

$ phpbench run src/EncodeBench.php --report=aggregate --php-config='{"zend.assertions":-1}'
PHPBench (1.2.14) running benchmarks...
with configuration file: /var/bench/phpbench.json
with PHP version 8.1.24, xdebug ❌, opcache ❌

\Benchmark\EncodeBench

    benchPHPImplementation..................I4 - Mo22.591ms (±1.66%)
    benchRPCCounter.........................I4 - Mo4.339ms (±2.95%)

Subjects: 2, Assertions: 0, Failures: 0, Errors: 0
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| benchmark   | subject                | set | revs | its | mem_peak | mode     | rstdev |
+-------------+------------------------+-----+------+-----+----------+----------+--------+
| EncodeBench | benchPHPImplementation |     | 100  | 5   | 18.606mb | 22.591ms | ±1.66% |
| EncodeBench | benchRPCCounter        |     | 100  | 5   | 18.606mb | 4.339ms  | ±2.95% |
+-------------+------------------------+-----+------+-----+----------+----------+--------+

We need to further investigate the issue to understand whether optimization is possible.

Benchmark code ```php encoder = $provider->get('cl100k_base'); $this->httpClient = $httpClient; $this->text = $httpClient ->request('GET', 'https://baconipsum.com/api/?type=meat-and-filler¶s=100&format=text') ->getContent(); } #[Bench\Iterations(5)] #[Bench\Revs(100)] #[Bench\Warmup(1)] public function benchPHPImplementation(): void { count($this->encoder->encode($this->text)); } #[Bench\Iterations(5)] #[Bench\Revs(100)] #[Bench\Warmup(1)] public function benchRPCCounter(): void { $this->httpClient ->request('POST', 'http://tiktoken-counter:8000/count', [ 'json' => [ 'text' => $this->text, 'encoding' => 'cl100k_base', ] ]) ->toArray()['tokens']; } } ``` ```yaml version: "3.7" services: bench: build: dockerfile: docker/Dockerfile depends_on: - tiktoken-counter working_dir: "/var/bench" volumes: - ".:/var/bench" tiktoken-counter: image: ghcr.io/flexchar/tiktoken-counter expose: - "8000" ```
flexchar commented 11 months ago

In that case something like PHP-FFI with a native implementation in C++ could be more fair game! I see there are several implementations https://github.com/sewenew/tokenizer

yethee commented 5 months ago

I updated implementation, this allowed to speed up converting text into tokens ~ 2 times, for your example. You can check #10 for details.

flexchar commented 5 months ago

That is extra ordinary work! I also had a thought that it could perhaps be possible by calling tiktoken written in C++ using PHP FFI. While I understand the overview, that is sadly far beyond my skillset.✌️