Closed andy-sh closed 1 year ago
You should use encoding r50k_base
, if want to get the same result as in the online tokenizer.
$tokens = (new EncoderProvider())->get('r50k_base')->encode('作为一个猎头公司,您拥有大量的履历表和候选人的反馈信息,这些信息对于企业客户来说非常有价值。');
count($tokens); // OUT: 98
Encoding cl100k_base
is used for GPT-3.5 and GPT-4.
The calculation of Chinese characters deviates greatly from the official website, example: