yethee / tiktoken-php

This is a port of the tiktoken
MIT License
100 stars 22 forks source link

The calculation of Chinese characters deviates greatly from the official website #3

Closed andy-sh closed 1 year ago

andy-sh commented 1 year ago

The calculation of Chinese characters deviates greatly from the official website, example:

作为一个猎头公司,您拥有大量的履历表和候选人的反馈信息,这些信息对于企业客户来说非常有价值。

yethee commented 1 year ago

You should use encoding r50k_base, if want to get the same result as in the online tokenizer.

$tokens = (new EncoderProvider())->get('r50k_base')->encode('作为一个猎头公司,您拥有大量的履历表和候选人的反馈信息,这些信息对于企业客户来说非常有价值。');
count($tokens); // OUT: 98

Encoding cl100k_base is used for GPT-3.5 and GPT-4.