Improved handling of strings and unicode

ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

https://ziglang.org

MIT License

34.61k stars 2.53k forks source link

Improved handling of strings and unicode #234

Closed ddevault closed 7 years ago

ddevault commented 7 years ago

Treating []u8 as strings is incorrect. []u8 is an array of octets, not an array of characters. Zig should support Unicode more explicitly and enforce the distinction between []u8 and str in the language and standard library.

I propose adding a rune type, which holds one unicode codepoint. The underlying storage mechanism isn't relevant to the programmer, who can only assume it's an int capable of holding a unicode codepoint. On platforms whose pointers are sufficiently sized, it should probably be a usize under the covers. I also propose adding a str type, which is opaque but offers length and indexing of runes. The underlying string encoding is also not important to the programmer, but some possible strategies include always using UTF-8 or UCS-32, or upgrading the encoding as necessary to fit the runes the user attempts to place in it.

Also provided should be standard library functions for manipulating strings separately from []u8, and helpful functions to convert str to []u8 and back again in arbitrary encodings.

c-cube commented 3 years ago

providing string indexing that returns a grapheme cluster seems quite bad to me, it hides a very complex operation under syntax that is generally O(1). Rust does it better in this case, imo, by not providing string indexing at all.

andrewrk commented 3 years ago

@jecolon thank you for your comments. Before tagging 1.0, I will be personally auditing std.unicode (and the rest of std) while inspecting ziglyph carefully for inspiration. If you're available during that release cycle I would love to get you involved and work with you an achieving a reasonable std lib API.

In fact, if you wanted to make some sweeping, breaking changes to std.unicode right now, upstream, I would be amenable to that. The only limitation is that we won't have access to the Unicode data for the std lib. If you want to make a case that we should add that as a dependency of zig std lib, I'm willing to hear that out, but for status quo, that is a limitation because of not wanting to take on that dependency.

jecolon commented 3 years ago

@jecolon thank you for your comments. Before tagging 1.0, I will be personally auditing std.unicode (and the rest of std) while inspecting ziglyph carefully for inspiration. If you're available during that release cycle I would love to get you involved and work with you an achieving a reasonable std lib API.

You can count on that 100%! 💯

In fact, if you wanted to make some sweeping, breaking changes to std.unicode right now, upstream, I would be amenable to that. The only limitation is that we won't have access to the Unicode data for the std lib. If you want to make a case that we should add that as a dependency of zig std lib, I'm willing to hear that out, but for status quo, that is a limitation because of not wanting to take on that dependency.

I'll be analyzing the options for this to see if I can come up with a good proposal. The Unicode data dependency issue is at the heart of this, definitely.

@andrewrk : Thanks for this opportunity to help! :^)

doffltmiw commented 2 years ago

I've stumbled upon these slides, D at 20, Hits and Misses, by Walter Brighter.

Unicode all the way

Code pages, EBCDIC, Shift-JIS, etc., should all be processed as ubyte arrays, not char arrays

Miss: Agnostically Supporting UTF-16 and UCS-2

Turns out they're sideshows.
UTF-8 is the one.

Strings are Arrays

● No special string type!

https://dlang.org/articles/d-array-article.html

Miss: Then We Botched It

● autodecoding the strings
– sometimes it decodes code units into code points
– sometimes it does not

● still trying to dig our way out of that