zurawiki / tiktoken-rs

Ready-made tokenizer library for working with GPT and tiktoken
MIT License
240 stars 46 forks source link

fix: expose num_tokens_from_messages and ensure its implementation is consistent with OpenAI #14

Closed j178 closed 1 year ago

j178 commented 1 year ago

Current num_tokens_from_messages is not consistent with OpenAI.

data.json:

{
  "model": "gpt-3.5-turbo",
  "max_tokens": 1,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English."
    },
    {
      "role": "system",
      "name": "example_user",
      "content": "New synergies will help drive top-line growth."
    },
    {
      "role": "system",
      "name": "example_assistant",
      "content": "Things working well together will increase revenue."
    },
    {
      "role": "system",
      "name": "example_user",
      "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."
    },
    {
      "role": "system",
      "name": "example_assistant",
      "content": "Let's talk later when we're less busy about how to do better."
    },
    {
      "role": "user",
      "content": "This late pivot means we don't have time to boil the ocean for the client deliverable."
    }
  ]
}
curl -s -X POST -H "Content-Type: application/json" -d "@data.json" https://api.openai.com/v1/chat/completions -H "Authorization: Bearer $OPENAI_API_KEY" | jq '.usage.prompt_tokens'

The result is 127, but current num_tokens_from_messages gives 122.