zurawiki / tiktoken-rs

Ready-made tokenizer library for working with GPT and tiktoken
MIT License
240 stars 46 forks source link

Incomplete utf-8 byte sequence from index 0 #23

Closed BohuTANG closed 1 year ago

BohuTANG commented 1 year ago

The code is:

#[test]
fn test_token() {
    let input = "🍌This is a sentence   with spaces, hahhahah haha ha";
    let rke = r50k_base()?;
    let _ = rke.split_by_token(input, true).unwrap();
}

Error:

called `Result::unwrap()` on an `Err` value: incomplete utf-8 byte sequence from index 0

Caused by this line: https://github.com/zurawiki/tiktoken-rs/blob/1aa3d90f220ec4dc42a2ff489a42b75b4e3b6cf8/tiktoken-rs/src/vendor_tiktoken.rs#L618

It should be String::from_utf8_lossy?

zurawiki commented 1 year ago

Thanks for the clear repro. Unfortunately, not every string sequence can be properly split into Unicode-compatible chunks. I added test cases to ensure splitting and round_trip work as expected in #24.