zurawiki / tiktoken-rs

Ready-made tokenizer library for working with GPT and tiktoken
MIT License
240 stars 46 forks source link

Improved interface for split_by_token #18

Closed jackbackes closed 1 year ago

jackbackes commented 1 year ago

The split_by_token_ordinary method and its corresponding iterator split_by_token_ordinary_iter have been added to the CoreBPE struct in vendor_tiktoken.rs. These methods allow for ordinary tokenization of a string without special tokens from the BPE model.

Simplified the .split_by_token_with_special_tokens method to just be split_by_token and differentiated between methods that return iter vs collection.

jackbackes commented 1 year ago

I thought about this some more - I think this interface is more in line with the rest of the codebase.

zurawiki commented 1 year ago

nice!