Unicode support - Githubissues

benman1 commented 4 years ago

Please implement unicode string support.

In C++, std::wstring is a wrapper for wchar_t similar to std::string which is a wrapper for char. wchar_t is defined in C as well [1]. A similar API in C is Glib::ustring.

The major difference to std::string is that a character is defined by 4 bytes rather than 1.

jacereda commented 3 years ago

I would stick to UTF-8 encoded strings and just implement a glyphs for getting an array of unicode code points.

aep commented 3 years ago

that sounds like a good idea to me, assuming by array you mean an iterator. is there a portable C library for doing that?

aep commented 3 years ago

is this sane? https://github.com/adricoin2010/UTF8-Iterator

looks suspiciously simple

jacereda commented 3 years ago

Not so simple, I prefer this one:

https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

Scroll down to the bottom, there's a better implementation than the one on top.

aep commented 3 years ago

ok i'm implementing this now, but i need everyone to comment on the api and its features.

here's my first draft.

string::String is similar to slice::Slice, except there's no MutSlice and the iterators are on utf8 codepoints rather than bytes
string::buffer::StringBuffer is similar to buffer::Buffer and autocasts to String

pub fn main() {

   /// construct a new string reference from a borrowed null terminated char*
   let s = string::from_cstr("你好世界");

   /// len() counts codepoints or scalar values?
   err::assert(s.len() == 4);

   /// iterator over codepoints
   for let mut it = s.iter(); it.next(); {
       /// one codepoint is 4 byte long
       u32 ch = it.val;
   }

   /// note the lack of char indexing.
   /// this is not possible
   u32 ch = s[2];

   // but you can convert it to a slice for byte indexing
   let sl = s.as_utf8_slice();
   u8 meh = sl.mem[2];

   // or copy to a vec
   new[item = u32, +100] v = s.to_vec();
   u32 bleh = v.items[2];

   /// return string as null terminated utf8 char*
   printf("%s", s.cstr());

   /// concat two strings using a string buffer
   new[+1000] b = string::buffer::make();
   b.append(string::from_cstr("hello world"));
   b.append(string::from_cstr("  "));
   b.append(string::from_cstr("你好世界"));

   /// borrow a buffer as str
   let x = b.as_str();   

   /// split
   usize mut iterator = 0;
   let s1 = x.split(" ", &iterator);
   let s2 = x.split(" ", &iterator);

   /// compare
   err::assert(!s1.eq(s2));

   /// substrings compares
   err::assert(s2.starts_with(string::from_cstr("你")));

}

we MIGHT also completely replace char* with string::String some day, removing the explicit calls to from_cstr, but not until we're sure string is ready

jwerle commented 3 years ago

Give me a few

jwerle commented 3 years ago

This API looks pretty straightforward and absolutely needed. I am happy that we took the approach to rewrite the string module with utf8 in mind!

My only (unrelated) question is:

// or copy to a vec
new[item = u32, +100] v = s.to_vec();

Can we do this now? (new constructor from an "instance" method)

aep commented 3 years ago

oh right, i actually forgot that's broken, thanks for the reminder

opened https://github.com/zetzit/zz/issues/123

jacereda commented 3 years ago

Why is string needed? Wouldn't a uiter() for iterating over unicode code points on a slice suffice?

aep commented 3 years ago

technically yes, but string manipulation behaves differently on unicode vs bytes. having to prefix all functions with unicode_split etc seems awkward and the type is effectively free as its just emitted as fat pointer to C

also slice holds any arbitrary binary data, string holds null terminated utf8. this distinction is useful in api contracts and automatic mapping to other type systems

aep commented 3 years ago

actually i wonder if we can use attached type aliases to implement it as specialized slice.

type String = slice::Slice[nullterm(self.mem), utf8(self.mem)];

edit: never mind, still would have to prefix utf8 specific functions, which is weird. but String can just inherit from slice by first-member rule, so you can use it as if it was a slice.

sternenseemann commented 3 years ago

I'd recommend using Julia's utf8proc which is reasonably lightweight, supports UTF-8 decoding and encoding (from and to codepoints) and other features that definitely needed for proper unicode handling like utf8 normalization and grapheme clustering.

zetzit / zz

Unicode support #44