yethee / tiktoken-php

This is a port of the tiktoken
MIT License
92 stars 19 forks source link

Count tokens for tools / functions calls #11

Open denistorresan opened 2 months ago

denistorresan commented 2 months ago

Hello, looking around I found an implementation in Java and Node about calculate tokens on functions / tools call. I migrated the code in PHP, maybe useful adding that code on the project.

I started from this thread:

https://community.openai.com/t/how-to-calculate-the-tokens-when-using-function-call/266573/10

and I take code from these repos and migrated to raw PHP:

https://github.com/forestwanglin/openai-java/blob/72d7bfc8ffb1bfb810b99518d0f99110e3204227/jtokkit/src/main/java/xyz/felh/openai/jtokkit/utils/TikTokenUtils.java#L392

https://github.com/hmarr/openai-chat-tokens/blob/main/src/functions.ts (look this blog https://hmarr.com/blog/counting-openai-tokens/)

Attached a raw implementation in PHP of TikTokenUtils

<?php

namespace App\Core\Text;

use Yethee\Tiktoken\EncoderProvider;

/**
 * Text Utils
 * 
 * @author Denis
 */
class TikTokenUtils {

    /**
     * Count tokens
     * 
     * @param string $text
     * @param string $model
     * @return int
     */
    public static function tokens(string $text, $model = 'gpt-3.5-turbo'): int {
        $provider = new EncoderProvider();
        $encoder = $provider->getForModel($model);
        $tokens = $encoder->encode($text);

        return count($tokens);
    }

    /**
     * Count Functions / Tools tokens
     *
     * @param string $text
     * @param string $model
     * @return int
     */
    public static function functionsTokens(array $tools, $model = 'gpt-3.5-turbo'): int {
        $tokens = 0;

        if( !empty( $tools ) ) {
            $tokens = self::tokens(self::formatFunctionDefinitions($tools));
            $tokens += 9; // Additional tokens for function definition
        }

        return $tokens;
    }

    /**
     * Format Function Definitions as TypeScript
     * OpenAI appears to be turning the function definitions into TypeScript type definitions.
     * 
     * Migrated from https://github.com/forestwanglin/openai-java/blob/main/jtokkit/src/main/java/xyz/felh/openai/jtokkit/utils/FunctionFormat.java
     * 
     * This code return a tools format definition converted to TypeScript
     */
    public static function formatFunctionDefinitions($tools) {
        $lines = array();
        $lines[] = "namespace functions {";
        $lines[] = "";

        foreach ($tools as $tool) {
            if(!empty($tool['function']['description'])) {
                $lines[] = sprintf("// %s", $tool['function']['description']);
            }

            if (!empty($tool['function']['parameters']['properties'])) {
                $lines[] = sprintf("type %s = (_: {", $tool['function']['name']);
                $lines[] = self::formatObjectProperties($tool['function']['parameters']['properties'], 0);
                $lines[] = "}) => any;";
            } else {
                $lines[] = sprintf("type %s = () => any;", $tool['function']['name']);
            }

            $lines[] = "";
        }

        $lines[] = "} // namespace functions";

        return implode("\n", $lines);
    }

    /**
     * Convert properties to TypeScript
     * 
     * @param $properties
     * @param $indent
     * @return array|string
     */
    public static function formatObjectProperties($properties, $indent) {

        if (empty($properties)) {
            return "";
        }

        $requiredParams = array();
        if(!empty( $properties["required"] )) {
            $requiredParams = $properties["required"];
        }

        $lines = array();

        foreach ($properties as $name => $property) {

            if (!empty($property["description"]) && $indent < 2) {
                $lines[] = sprintf("// %s", $property["description"]);
            }

            if (in_array($name, $requiredParams)) {
                $lines[] = sprintf("%s: %s,", $name, self::formatType($property, $indent));
            }
            else{
                $lines[] = sprintf("%s?: %s,", $name, self::formatType($property, $indent));
            }

        }

        return implode("\n", array_map(function ($it) use ($indent) {
            return str_repeat(" ", max(0, $indent)) . $it;
        }, $lines));

    }

    /**
     * Format single property type to TypeScript
     *  
     * @param $property
     * @param $indent
     * @return string
     */
    public static function formatType($property, $indent) {
        $type = $property["type"];

        switch ($type) {
            case "string":
                if (!empty($property["enum"])) {
                    return implode(" | ", array_map(function ($it) {
                        return sprintf("\"%s\"", $it);
                    }, $property["enum"]));
                }
                return "string";
            case "array":
                if (!empty($property["items"])) {
                    return sprintf("%s[]", self::formatType($property["items"], $indent));
                }
                return "any[]";
            case "object":
                return sprintf("{\n%s\n}", self::formatObjectProperties($property, $indent + 2));
            case "integer":
            case "number":
                if (!empty($property["enum"])) {
                    return implode(" | ", array_map(function ($it) {
                        return sprintf("\"%s\"", $it);
                    }, $property["enum"]));
                }
                return "number";
            case "boolean":
                return "boolean";
            case "null":
                return "null";
            default:
                return "";
        }
    }
}

Example of how to use:

        $tools = [
            [
                'type' => 'function',
                'function' => [
                    'name' => 'get_flight_status',
                    'description' => 'Get the status of a flight by its flight number. The answer must always provide the coming_from, airline, flight_status, estimated_arrival_time and delayed_arrival_time when not empty.',
                    'parameters' => [
                        'type' => 'object',
                        'properties' => [
                            'flight_number' => [
                                'type' => 'string',
                                'description' => 'The Flight Number, MUST respect this pattern: 2 letters, and 5 numbers and may contain spaces; eg: BA00576',
                            ],
                            'day' => [
                                'type' => 'string',
                                'description' => 'The day of the flight',
                            ],
                        ],
                        'required' => ['flight_number'],
                    ],
                ],
            ],
            [
                'type' => 'function',
                'function' => [
                    'name' => 'get_time',
                    'description' => 'Get the current time.',
                ],
            ]
        ];

        //get number of token for tools
        echo TikTokenUtils::functionsTokens($tools);