#[test]
fn test_token() {
let input = "🍌This is a sentence with spaces, hahhahah haha ha";
let rke = r50k_base()?;
let _ = rke.split_by_token(input, true).unwrap();
}
Error:
called `Result::unwrap()` on an `Err` value: incomplete utf-8 byte sequence from index 0
Thanks for the clear repro. Unfortunately, not every string sequence can be properly split into Unicode-compatible chunks. I added test cases to ensure splitting and round_trip work as expected in #24.
The code is:
Error:
Caused by this line: https://github.com/zurawiki/tiktoken-rs/blob/1aa3d90f220ec4dc42a2ff489a42b75b4e3b6cf8/tiktoken-rs/src/vendor_tiktoken.rs#L618
It should be
String::from_utf8_lossy
?