ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.6k stars 2.53k forks source link

Improved handling of strings and unicode #234

Closed ddevault closed 7 years ago

ddevault commented 7 years ago

Treating []u8 as strings is incorrect. []u8 is an array of octets, not an array of characters. Zig should support Unicode more explicitly and enforce the distinction between []u8 and str in the language and standard library.

I propose adding a rune type, which holds one unicode codepoint. The underlying storage mechanism isn't relevant to the programmer, who can only assume it's an int capable of holding a unicode codepoint. On platforms whose pointers are sufficiently sized, it should probably be a usize under the covers. I also propose adding a str type, which is opaque but offers length and indexing of runes. The underlying string encoding is also not important to the programmer, but some possible strategies include always using UTF-8 or UCS-32, or upgrading the encoding as necessary to fit the runes the user attempts to place in it.

Also provided should be standard library functions for manipulating strings separately from []u8, and helpful functions to convert str to []u8 and back again in arbitrary encodings.

andrewrk commented 7 years ago

It seems to me that what this issue is calling for is standard library code to handle strings.

Can you explain what the use case is for the rune and str types you proposed? Under what circumstances would you use them?

For example, it seems to me that the hello world application should use []u8 and not str for the command line arguments to main, as well as the bytes being printed to stdout, because that's what is happening - command line args are arrays of octets, and what goes to stdout is arrays of octets. Text editors typically encode characters as UTF-8, and zig allows UTF-8 (any array of octets, really) in string literals. So if, for example, the hello world application used these types at all, it would introduce an unnecessary runtime conversion between str encoded data and []u8.

Point being, when actually do we want to use these types being proposed? I can certainly think of some use cases, such as implementing a text box in a GUI application. But at that point, why does it need to be part of the language? Would not a standard library module suffice?

ddevault commented 7 years ago

Can you explain what the use case is for the rune and str types you proposed? Under what circumstances would you use them?

During any string manipulation. Iterating over a str with for would give you a bunch of runes. This is more useful than iterating over a bunch of u8s, which are not characters but could be partial runes depending on the encoding of the []u8.

zig allows UTF-8 (any array of octets, really) in string literals [...] it would introduce an unnecessary runtime conversion between str encoded data and []u8.

It is necessary. Not decoding the strings will produce buggy code. The compiler can optimize the conversion away by compiling string literals into str structs instead of []u8, which it should do.

hello world application should use []u8 and not str for the command line arguments to main, as well as the bytes being printed to stdout

Args could be []u8, sure, because that's what they are. You would have to decode them before using them as strings. But there's no reason you couldn't just write []u8 to stdout if you prefer. Could also have a special syntax ala Python for string literals that are encoded as UTF-8 and become []u8 rather than str. Would also be nice to detect the signature of main and do sane argument decoding in the crt0 if the user requests args as []str.

Point being, when actually do we want to use these types being proposed? I can certainly think of some use cases, such as implementing a text box in a GUI application. But at that point, why does it need to be part of the language? Would not a standard library module suffice?

String handling is intrinsic to any programming language. Well over half of programs will do string manipulation, I expect. []u8 are not strings, and if you try to do string manipulation with them, your code will be broken. I strongly encourage you to add a language-level distinction between []u8 and str to prevent users from running into bugs. This discussion happened for Python 3 and they made the correct choice, by the way.

andrewrk commented 7 years ago

It is necessary. Not decoding the strings will produce buggy code.

const io = @import("std").io;

pub fn main(args: [][]u8) -> %void {
    %%io.stdout.printf("Hello, 世界\n");
}
$ ./test
Hello, 世界

Where's the bug?

What does an additional type in this example accomplish?

It sounds like you're saying, this is not an example where it is necessary, but users will have other use cases where they want to be doing string manipulation rather than array of octet manipulation, such as, say, taking stdin, uppercasing it, and printing it to stdout. Would that be a fair use case?

ddevault commented 7 years ago

The bug doesn't present itself in the simple case. Here are a number of examples that would break or be unsupported:

Your example only works because you're really just copying []u8 around. You're not actually doing string operations.

ddevault commented 7 years ago

Why do you think any major language designed in the past 10 years, no matter how low level, has had proper unicode string support?

thejoshwolfe commented 7 years ago

So proper unicode string support is doable in userland without any modification to the Zig language or runtime. String literals can be converted to anything at compile time with userland functions, and all those string manipulation operations mentioned above can be done in userland functions as well, even also at compile time.

Am I right in saying that this is a proposal for a standard library module, and not a proposal for a language change?

ddevault commented 7 years ago

Not as far as I can tell. I presume that making some_str[1] do the right thing involves language changes, and making string literals emit the string type instead of []u8 is also a language change. I'm not sure if zig already supports compile-time reflection to determine the parameters of main - to support the crt0 change I proposed that may require language changes.

thejoshwolfe commented 7 years ago

I presume that making some_str[1] do the right thing involves language changes

With a userland string solution, you would probably not be able to do some_str[1], unless some_str was a slice. If your string solution is a struct that contains a slice, then perhaps some_str.chars[1] would work without language changes.

But if your string solution is utf8-encoded or a rope data structure or something else that's not simply a slice of characters, then character access wouldn't be as simple as the [] makes it look. Zig does not have operator overloading, and that's to avoid hidden runtime costs. If character access requires a O(log(n)) tree traversal, then make that operation a function and call it like a function.

String literals can be converted to any encoding you want at compile time like this:

const motd = decodeUtf8("こんにちは");

I'm not sure if zig already supports compile-time reflection to determine the parameters of main - to support the crt0 change I proposed that may require language changes.

Is there a reason other than convenience why you wouldn't want to do the conversion at the top of your main implementation by calling a userland function? Is there a reason to do it in the crt0 instead?

ddevault commented 7 years ago

But if your string solution is utf8-encoded or a rope data structure or something else that's not simply a slice of characters, then character access wouldn't be as simple as the [] makes it look. Zig does not have operator overloading, and that's to avoid hidden runtime costs. If character access requires a O(log(n)) tree traversal, then make that operation a function and call it like a function.

I'm not suggesting operator overloading - I'm just suggesting strings behave this way, which is why a language change is required.

String literals can be converted to any encoding you want at compile time like this:

What does that even do? The behavior of your example is not predictable by people who understand string encodings without reading the docs and probably the code because when you write the docs you will likely fail to understand what's confusing about it.

Is there a reason other than convenience why you wouldn't want to do the conversion at the top of your main implementation by calling a userland function? Is there a reason to do it in the crt0 instead?

No, just convenience.

thejoshwolfe commented 7 years ago

There's an elephant in the room in this discussion, which is that Zig wants to take memory allocation seriously. How exactly to handle memory allocation is a big discussion, and one that should probably happen in a different issue. Some high level points that are relevant here are:

There's a lot there to discuss, and again, that should probably be in another issue. The point I'm trying to make in this issue is that we can't supply functions like string splitting without thinking about memory allocation. Languages that don't ask you to think about memory allocation are, from Zig's perspective, sub-optimal languages. Python 3, JavaScript, and Java all have garbage collection, which makes string manipulation look very nice at a high level, but fails Zig's goal of optimality.

So far, Zig provides a List(T) class that does memory management by asking explicitly for an allocator when you construct the list. If we run with this idea for now, then you could make a string builder class that can decode from utf8, encode to a utf8 output buffer, stores unicode data opaquely, even offers functions like splitting and random character access. Does it make sense for one of these string builds to exist at compile time? Maybe, but it would be complicated, since allocators probably need to work differently at runtime vs compile time. Does Zig want to supply such a string builder class in its standard library? Maybe.

@SirCmpwn Can you give examples of low level languages with proper unicode string support? I'd like to check out how they do memory management.

ddevault commented 7 years ago

Rust and Go come to mind as being fairly low level and having sane Unicode support. Also, you can do a lot of things with this design without bringing allocation into the discussion.

thejoshwolfe commented 7 years ago

After a few minutes of research, it looks like Rust supports 1 global allocator per build artifact determined at compile time. This lacks the features of the Jai solution, which allow multiple distinct concurrent allocators running in the same application at the same time.

I believe Go has a memory management strategy with hidden allocations and a garbage collector. I've heard some people call Go a low level language, and I'm aware of an intense debate on the internet about Go's memory management strategy, even getting so intense as to call Go's marketers liars. But Go drama aside, Go's memory management strategy is not acceptable for Zig, so Go's unicode strategy is not very helpful in designing a unicode strategy for Zig.

ddevault commented 7 years ago

I wouldn't mind a global allocator. Perhaps you could use an allocator keyword to set a new allocator for a given scope? Again, though, many many sane Unicode string handling functions don't need allocators. And for that matter non-sane string splitting probably needs allocation as well. I don't really this it's relevant to this issue.

thejoshwolfe commented 7 years ago

I wouldn't mind a global allocator. Perhaps you could use an allocator keyword to set a new allocator for a given scope?

That sounds like Rust and Jai respectively.

The allocation discussion is a bit off topic, but it is relevant to keep in mind. Let's get back to string support and discuss some string functions/methods we might want to have.

I've gone through the list of Java 7's String methods and pasted in some highlights to discuss. The code examples are Java's API, not a proposed signature for Zig, although I'd like to discuss what Zig's version of each feature would be.

Additionally, both Rust and Java 7 seem to have methods related to interpreting UTF-8 or UTF-16 bytes as sequences of variable-length codepoints, but that might not be necessary depending on the string implementation. It does raise a question though, which is how should unicode strings really be implemented?

I propose adding a rune type, which holds one unicode codepoint. The underlying storage mechanism isn't relevant to the programmer, who can only assume it's an int capable of holding a unicode codepoint.

To me, this just means it's a u32. The range of possible unicode codepoints isn't a mystery. It's 0-0x10FFFF, which is bigger than a u16 and small enough for a u32.

On platforms whose pointers are sufficiently sized, it should probably be a usize under the covers.

Wouldn't this be way too big on 64-bit platforms?

The underlying string encoding is also not important to the programmer, but some possible strategies include always using UTF-8 or UCS-32, or upgrading the encoding as necessary to fit the runes the user attempts to place in it.

I've seen these strategies done before, and they've all got their strengths and weaknesses. I've come to the conclusion that there is no such thing as a single best implementation of a unicode string, but rather countless subtle optimizations you can make to suit your different usecases. (This makes string implementations very similar to memory allocators in that regard.)

Zig can provide a general-purpose string implementation, but I don't like the idea of the standard implementation getting any special treatment that a homemade implementation can't get. An optimal string implementation is part of Zig's quest for optimality, and that's not possible with a standard string implementation. This means that userland string solutions need to be first-class citizens.

ddevault commented 7 years ago

String(byte[] bytes, Charset charset) and byte[] getBytes(Charset charset): What charsets should be supported? Just UTF-8? Maybe also ISO-8859-1? Maybe also Windows-1252? Maybe "all of them"? Maybe that's configurable at compile time? Or maybe there could be a dynamic library that provides these? Memory allocation is relevant here.

I would make encoding a seperate concern from the rest of the string impl and put it in its own module. Not that it answers any of your questions, just a comment I have.

int compareToIgnoreCase(String str), boolean equalsIgnoreCase(String anotherString), String toLowerCase(), and String toUpperCase(): This requires a table of unicode points with data about each character. This would be a significant feature to provide, and we may want to provide a standard solution to this. Memory allocation is relevant to the last two methods here.

String toLowerCase(Locale locale) and String toUpperCase(Locale locale): I didn't know that uppercasing and lowercasing were sensitive to locale. Should Zig worry about this?

Most languages choose to only handle upper and lowercase for latin characters, which is the only commonly used set of characters for which it really makes much linguistic sense. In a Unicode implementation you'll find that human languages are really resistant to implementing in software, and in the standard library will probably have to concede to only handling the common cases and leave exhaustive implementations of this and that third parties.

int hashCode(): Implementations would be easy, but deciding on an implementation might be hard.

Zig should probably standardize a hashing strategy for all things, not just strings.

Wouldn't this be way too big on 64-bit platforms?

You're right, it should just be a u32.

thejoshwolfe commented 7 years ago

Most languages choose to only handle upper and lowercase for latin characters, which is the only commonly used set of characters for which it really makes much linguistic sense. In a Unicode implementation you'll find that human languages are really resistant to implementing in software, and in the standard library will probably have to concede to only handling the common cases and leave exhaustive implementations of this and that third parties.

Here's an example corroborating your point. In JavaScript "ΣΣ".toLowerCase() == "σς". The same uppercase sigma lowers into two different lowercase sigmas, because there's a special character for a lowercase sigma at the end of a word.

This kinda makes me want to not even both with uppercase/lowercase at all, not even for the ascii characters, just so no one is expecting things to work when they don't. Either that, or offer toUpperCase and toLowerCase just for u8s, and possibly even explicitly say it's just for ascii, like asciiToUpperCase(). This could be useful for hexidecimal representations, for example.

andrewrk commented 7 years ago

I propose adding a rune type I also propose adding a str type

I'm not convinced that this is a language change rather than a standard library feature.

jmonasterio commented 6 years ago

If this isn't implemented right, there can be lots of future pain.

https://docs.python.org/release/3.2/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

robinei commented 6 years ago

There are lots of misunderstandings about unicode codepoints. They are not "characters" in general; actual on-screen symbols/glyphs (grapheme clusters) are variable codepoint-length.

The following operations were mentioned:

When I refer to u8 sequences I mean utf8 encoded strings. I do see the value of a type that witnesses a valid utf8 encoded string, and which supports various codepoint decode utilities, and maybe grapheme cluster splitting etc.

ds2643 commented 4 years ago

This thread just gets better with time.

CantrellD commented 4 years ago

Can we get some clarification on how runes and strings could be implemented in the standard library, and how that will relate to language-level features like string literals?

The distinguishing property of a string, relative to a byte array, is that it always represents a valid sequence of Unicode code points, as interpreted using some (typically unspecified) text encoding. An actual string type allows you to express that with the static type system. If you use a byte array instead, then you need to validate the byte array at runtime before you can do anything with it. And then you need to validate it again, in the next function that does something with that byte array.

Obviously you wouldn't use an actual string type at an external boundary, e.g. the command line, where it would be invalid to assume properly encoded UTF-8 text. But a proper string type allows you to validate an untrusted byte array, and then (conditional on validation) use the new value (of type string) at any internal boundary where a trusted string is required.

So how would an actual string type be implemented? My best guess is that it would be (a pointer to?) an array of Runes, and that a Rune would be an opaque type that can only have values in the range [U+000000, U+10FFFF]. I think this could be enforced by e.g. exposing a function that accepts a 32 bit integer and returns either a Rune or an error, depending on the value of the integer.

Is that the intended path forward? If so, will string literals represent bytes or Runes? Will hex escapes that don't map to a valid Rune (e.g. \xFF) be removed from the language?

jakwings commented 4 years ago

@CantrellD It's already possible to implement comptime initialization/validation of UTF-8 text, like:

fn u(comptime s: []const u8) Utf8String {
    return try Utf8.stringFromUtf8(s) catch unreachable;
}

test "format and print" {
    // æ (U+00E6)
    // Utf8.print is an enhanced print
    try Utf8.print(stdout, "{}", .{"æ"}); // ok
    try Utf8.print(stdout, "{}", .{"\xC3\xA6"}); // ok
    try Utf8.print(stdout, "{}", .{"\xE6"}); // runtime error
    // new "z" specifier for arbitrary bytes (not NUL-terminated: "s")
    try Utf8.print(stdout, "{z}", .{"\xE6"}); // ok
    try Utf8.print(stdout, "{s}", .{"\xE6"}); // ok
    var s = "\xE6";
    try Utf8.print(stdout, "{}", .{s}); // runtime error
    try Utf8.print(stdout, "{z}", .{s}); // ok
    try Utf8.print(stdout, "{s}", .{s}); // ok

    var s1 = u("\xE6"); // comptime error
    var s2 = try Utf8.stringFromUtf8(s); // runtime error
    var s3 = Utf8.stringFromUtf8Unchecked(s); // risky
}
CantrellD commented 4 years ago

@iology Please excuse my ignorance, but are Utf8String and Utf8 already available in the standard library, or is that just an example? I tried to find them, but failed.

I ask in part because it isn't clear to me that you can instantiate Utf8String with a validated runtime value, which is an important use case.

jakwings commented 4 years ago

@CantrellD Yes, just example code. A lot can be learned from the Rust stdlib.

btw, correction to my example:

// these are all runtime behaviors, unless you `comptime print(...)` or forbid "{}" for []u8 at comptime
    Utf8.print("{}", .{"æ"}); // ok
    Utf8.print("{}", .{"\xC3\xA6"}); // ok
    Utf8.print("{}", .{"\xE6"}); // error

I ask in part because it isn't clear to me that you can instantiate Utf8String with a validated runtime value, which is an important use case.

While how we get a validated runtime value is still unknown (of what type the value is?), maybe you need a new builtin function like @toUtf8StringUnchecked or if possible simply @bitCast(string, validated_but_structure_unknown).

Is that the intended path forward? If so, will string literals represent bytes or Runes? Will hex escapes that don't map to a valid Rune (e.g. \xFF) be removed from the language?

I think no, otherwise UTF8-encoded raw identifiers would naturally be allowed and breaking change between versions of the Unicode standard would not be a concern for Zig. (#3947) edit: I guess there will unlikely be a full featured UTF-8 module in the standard library.

Though not a big problem, restriction on \xHH will make it inconvenient for initializing byte strings or [ASCII-compatible encoding inserted here] strings.

CantrellD commented 4 years ago

While how we get a validated runtime value is still unknown (of what type the value is?), maybe you need a new builtin function

I'm not sure you do need a new builtin function, actually; I think it may be sufficient to define a library which exports (a) an opaque type called string, (b) a function that transforms untrusted byte arrays into strings (or else fails, if the byte array is invalid), and (c) a set of fundamental functions for string processing. You'd need to avoid instantiating invalid strings within that library, but outside the library I believe it would be impossible to do so.

That assumes that you want a string type, to enforce the weird rules that Unicode tries to create for what a valid sequence of codepoints should look like. Regardless, I believe you'd need an opaque Rune type, probably defined in roughly the same way I just described, to restrict the range of values that can exist for individual codepoints.

Though not a big problem, restriction on \xHH will make it inconvenient for initializing byte strings or [ASCII-compatible encoding inserted here] strings.

Given that strings and bytestrings are different things, I think it's more reasonable to have distinct syntax for representing bytestrings. The hex escapes aren't safe in normal string literals, but they'd be fine in bytestring literals.

As it is, you can create "string" literals that aren't actually valid strings. I'm not aware of any other language (aside from Python, and probably C) that allows that.

Edit: I've been using "rune" as a synonym for "codepoint" to align with the terminology in the original post, but I finally looked it up, and I think it might be an alias for u32 that golang invented. So, it's probably better if I stop using it. Apparently static type safety for unicode strings is less ubiquitous than I thought.

jakwings commented 4 years ago

to enforce the weird rules that Unicode tries to create for what a valid sequence of codepoints should look like

Any sequence of codepoints never generate invalid code units (u8/u16/u32) and that all depends on locales and fonts in use, thus not a real issue for Zig as long as these functions are not required in the standard library.

I'm not sure you do need a new builtin function, actually; I think it may be sufficient to define a library which exports (a) an opaque type called string, (b) a function that transforms untrusted byte arrays into strings (or else fails, if the byte array is invalid), and (c) a set of fundamental functions for string processing.

This is how I view it: (a) plus (b) gives a "builtin" function, although not in the form @builtin. (a) plus (c) gives more builtins. This is because I assume that you also need indexing support s[index]. So it is an opaque type with indexing syntax support. (operator overloading is not supported at the moment, maybe never will) edit: excuse me, having a special syntax is already builtin support, a library is unable to invent new syntax. My brain needs to take some cool drink.

Given that strings and bytestrings are different things, I think it's more reasonable to have distinct syntax for representing bytestrings.

I'm completely fine with myComptimeCStrGeneratorThroughDoubleEscape("\\xHH\\xHH\\xHH"). No special syntax.

All the discussion from people above boils down to these questions:

  1. How much unicode support do you need on the syntax level? (source file already requires UTF-8 encoding)

  2. Is efficient handling of unicode text impossible without compiler support? (simple jobs can be covered by std.mem through manipulating bytes, so only special treatment needs to go into the Unicode module)

  3. Where and how often do you really need it? Do you just need a standard implementation in the stdlib? (other than initializing string literals in a special syntax plus comptime validation, why should it work differently?)

There are already some good arguments above. (2) is the most interesting but I'm not an expert on compiler.

CantrellD commented 4 years ago

Any sequence of codepoints never generate invalid code units (u8/u16/u32) and that all depends on locales and fonts in use, thus not a real issue for Zig as long as these functions are not required in the standard library.

By "weird rules" I meant e.g. codepoints in the range [U+D800, U+DFFF] being reserved exclusively for UTF-16 encoded text. I don't know if Zig will ever care about those rules; I was just noting that they exist.

This is because I assume that you also need indexing support s[index]. So it is an opaque type with indexing syntax support.

If indexing support is needed, then AFAICT it's not possible to implement a string type with static type safety (for UTF-8 validity) in the standard library. If that's the case, then why was this issue closed? Am I misunderstanding the original proposal?

I'm completely fine with myComptimeCStrGeneratorThroughDoubleEscape("\xHH\xHH\xHH"). No special syntax.

I didn't mean to suggest that bytestring literals are necessary, only that they're an option, if restricting string literals to valid UTF-8 is otherwise too inconvenient.

All the discussion from people above boils down to these questions

I'm not sure those cover the most central question of this issue, as I understand it:

Will Zig (or the standard library) expose a string type that allows you to prove (with the static type system) that a runtime string value has already been validated?

jakwings commented 4 years ago

By "weird rules" I meant e.g. codepoints in the range [U+D800, U+DFFF] being reserved exclusively for UTF-16 encoded text. I don't know if Zig will ever care about those rules; I was just noting that they exist.

Their literal form is already forbidden in source code, but the escaping (\u{D800}) is currently allowed for byte strings. They don't do any more harm than other code points except that they may be rejected by other libraries for the same reason.

Am I misunderstanding the original proposal?

I'm not sure if the OP had thought about your ideas. Isn't s[i] just a syntactic sugar for s.byteAt(i) or s.codePointAt(i) (may return error, may return a reference or a copy of the value)? My concern about builtins is only about the explicitness denoted by that @ symbol, please never mind.

Will Zig (or the standard library) expose a string type that allows you to prove (with the static type system) that a runtime string value has already been validated?

Not my expertise. Emphasis added for others. For some real code you can look at std.unicode.Utf8View.initComptime

CantrellD commented 4 years ago

Their literal form is already forbidden in source code, but the escaping (\u{D800}) is currently allowed for byte strings.

...Huh. The status quo actually makes way more sense if I interpret everything as a bytestring, as opposed to a unicode string that Zig represents using bytes for some reason. It is of course reasonable to allow arbitrary bytes in bytestrings; I'm just slow.

I'm not sure if the OP had thought about your ideas.

Yeah, that's entirely possible. I don't think of them as my ideas, but I'm also not sure how prevalent they are outside of OOP. I shouldn't make assumptions.

Not my expertise. Emphasis added for others. For some real code you can look at std.unicode.Utf8View.initComptime

Totally fair, and appreciated. Thank you.

tmccombs commented 3 years ago

Rust and Go come to mind as being fairly low level and having sane Unicode support.

Rust's Unicode support is almost entirely part of the standard library. As far as I know the only reason str needs to be a native type in rust is because it is the type of string literals (as opposed to byte array literals), which are validated as utf-8 at compile time. And rust doesn't have a rune type either, it just uses u32 for codepoints (although a validating rune type could also be done in a standard library). Also, in rust, some_string[0] returns the first byte, not the first codepoint. Although iterating over a string does iterate over codepoints, not bytes.

Lokathor commented 3 years ago

Rust has a char type, which is 4 bytes but which is separate from plain u32 because it's got appropriate niches so Opion<char> and char have the same size.

For a similar thing in Zig, to have ?rune and rune be the same size, it would take language support.

The rest can be in the standard library (or even a user library).

Sobeston commented 3 years ago

Rust has a char type, which is 4 bytes but which is separate from plain u32 because it's got appropriate niches so Opion<char> and char have the same size.

For a similar thing in Zig, to have ?rune and rune be the same size, it would take language support.

The rest can be in the standard library (or even a user library).

Rune could be u21, leaving some bits free for the optional. So this wouldn't actually need any extra language support.

ikskuh commented 3 years ago

I think enum(u21) { _ } would be a better choice, as it's a non-arithmetic type. I also don't think that rune is a good type name, [as it's misleading](https://en.wikipedia.org/wiki/Runic_(Unicode_block) in respect to unicode), but should just be called codepoint in that case (as this is the correct unicode terminus)

But i'm not sure if all of that is worth the hassle

CannibalVox commented 3 years ago

Seems strange to me that the documentation says that "strings are an array of bytes" but you can't actually find their length like you would an array.

ifreund commented 3 years ago

Seems strange to me that the documentation says that "strings are an array of bytes" but you can't actually find their length like you would an array.

const std = @import("std");
test {
try std.testing.expectEqual(@as(usize, 3), "foo".len);
}
CannibalVox commented 3 years ago

Seems strange to me that the documentation says that "strings are an array of bytes" but you can't actually find their length like you would an array.

const std = @import("std");
test {
    try std.testing.expectEqual(@as(usize, 3), "foo".len);
}
test {
    try std.testing.expectEqual(@as(usize, 1), "🔥".len);
}
nektro commented 3 years ago

"🔥".len is actually equal to 4. It is codepoint U+1F525 and represented as F0 9F 94 A5.

CannibalVox commented 3 years ago

Correct, meaning that "🔥".len does not tell you the length of the string "🔥", which is 1. It tells you the length of the byte slurry that represents "🔥"

nektro commented 3 years ago

that is not relevant in 99% of use cases as described above. and in the few cases where it does matter the greater package ecosystem can fill in.

Lokathor commented 3 years ago

byte count, code point count, and grapheme cluster count are all valid things to want to know.

it just needs a little more docs that .len is "byte length", not any other measure.

CannibalVox commented 3 years ago

Languages written in code points of >1 bytes account for the majority of text read and written on planet earth, this isn't an edge case, this is the default case.

I disagree that this is a documentation issue, although that is a possible route to a resolution. Most slice functionality does not work for strings, because strings in zig are not represented as a slice of characters, but an opaque byte-serialized value.

tmccombs commented 3 years ago

strings in zig are not represented as a slice of characters

How do you define character? What do you expect the length of to be? I don't know of any language that will give you an answer of 1. Languages that use an array of codepoints (go, python) , or UTF-16/UCS-2 (java, javascript) will give you 2, languages that treat strings as byte arrays that have strings encoded as UTF-8 (zig, rust, c) will give you 3.

jakwings commented 3 years ago

Correct, meaning that "🔥".len does not tell you the length of the string "🔥"

Then what about "🇺🇸" if emoji is the most concerning use case?

Most slice functionality does not work for strings, because strings in zig are not represented as a slice of characters, but an opaque byte-serialized value.

Normal user-facing problems are mostly about proportional text (in variable width), fonts, alignments, line wraps and etc. which cannot be handled by pure character counting.

CannibalVox commented 3 years ago

Then what about "🇺🇸" if emoji is the most concerning use case?

If you are a human communicating in text on the internet, then odds are very good you use codepoints of more than one byte. Perhaps you use emoji. Perhaps you use mandarin, or hindi, or russian, or perhaps you speak english and use a text editor that creates curled quotation marks.

This is baseline knowledge for having a conversation about unicode.

How do you define character? What do you expect the length of ñ to be? I don't know of any language that will give you an answer of 1. Languages that use an array of codepoints (go, python) , or UTF-16/UCS-2 (java, javascript) will give you 2, languages that treat strings as byte arrays that have strings encoded as UTF-8 (zig, rust, c) will give you 3.

I'd love for zig to reason about graphemes somehow but codepoints are table stakes and I'll show you why:

    const stdout = std.io.getStdOut().writer();

    try stdout.print("{s}{s}{s}", .{"\n\n\n", "長"[0..1], "\n"});
Test [2/2] test "text too long"...

Θ
All 2 tests passed.

Also while both rust & go use bytes by default they have both have a cheap utility to get codepoints:

    fmt.Println(len("長"))
    fmt.Println(len([]rune("長"))) // This is actually optimized away at compile time

Rust has str.chars().

Which basically leaves zig and a language that predates the unicode consortium.

To me, making u21 a first-class citizen for string representation (that is, straightforward mechanism to have u21 literals, {s} supports u21 slices, full-slice encode & decode in the stdlib, at the very least) is a basic language feature.

jakwings commented 3 years ago

@CannibalVox There is already some basic support at lib/std/unicode.zig.

This is baseline knowledge for having a conversation about unicode.

You can be more explict about what kind of support you want instead of throwing out random examples. Many issues have been covered by previous comments. How can we expect the compiler to give the correct answer if we can't answer those questions? Would it be misleading that "🇺🇸".len gives us 2?

    fmt.Println(len("長"))
    fmt.Println(len([]rune("長"))) // This is actually optimized away at compile time

Please see lib/std/unicode.zig.

jecolon commented 3 years ago

I humbly suggest taking a look at Ziglyph and Zigstr, which basically provide all the functionality scattered throughout this thread in the form of userland libraries. From bytes, to code points, to grapheme clusters, to words, to sentences, to collation, and normalization, it's all there.

Now, aside from the shameless plug 😄 , if there's something I've learned from these past months immersed in the Unicode Standard, it's that setting up your Unicode text processing foundation based on the Code Point is a big mistake. Unicode "characters" are abstract concepts, and what humans perceive as "characters" may consist of more than one of those Unicode characters, which in turn may consist of one or more code points, which in turn may consist of one or more code units (which in the case of UTF-8 are bytes.) So if I were to choose a basic "character" data structure, I would model it to represent Grapheme Clusters, which are indeed what most humans perceive as characters. With grapheme clusters, "🔥".len is 1, no matter how many bytes or how many code points, which is what a normal human reader would expect.

Rust, Go, and any other language that provide indexing syntactic sugar into strings that produce individual code points are just plain wrong. A code point may be a character, as is the case with the ASCII subset in UTF-8, but when you have clusters that can be composed of up to 65 code points, returning just one of them is useless, pointless, and downright misleading.

PS: As a matter of fact, in Zigstr there is no "🔥".len , but rather there are zigstr.byteCount() , zigstr.codePointCount(), and zigstr.graphemeCount() methods precisely to emphasize and be clear about the differences between these concepts.

tmccombs commented 3 years ago

Rust, Go, and any other language that provide indexing syntactic sugar into strings that produce individual code points are just plain wrong

Rust doesn't have such syntactic sugar. It has syntactic sugar for getting a substring using byte indices, but not individual code points (or bytes) although there are functions that allow you to do so.

In go, indexing a string gives you a byte (strings are utf8 encoded), not a code point.

A language that does match your description is python, where strings are treated as arrays of code points.

Java and JavaScript are worse. Indexing returns a utf-16 code unit, which is almost never what you want.

jecolon commented 3 years ago

@tmccombs that's right, I stand corrected. I was thinking about the chars() method in Rust and the range loop on string in Go, which may mislead the user into thinking they're getting characters when they're actually getting code points. The Python, Java, and Javascript cases are definitely worse, I agree.

Which makes one wonder: All these languages are developed by really smart people, and yet they all seem to miss the mark when it comes to implementing the Unicode standard, specifically handling strings and attempting to define what characters are. It could be the result of half-hearted attempts, tackled reluctantly by skimming the standard or just copying other implementations. But I think the real culprit is the Unicode standard itself, being so voluminous and complex. Then again, human languages are indeed complex, so a simplified standard is probably an impossible dream.

Lokathor commented 3 years ago

The rust char type holds a Unicode Scalar Value, which is technically distinct from a Unicode Code Point because it's illegal to store certain values in a char. The advantage of this nonsense is that it takes a fixed amount of memory to store many things you'd want text for. Not all, but quite a bit. Quite a bit more than you can fit in a u8.

Basically you sometimes don't want to have dynamically sized values for everything.

In practice most people dont use char in most of their Rust code. Just using String and &str is usually enough.

jecolon commented 3 years ago

I remember a while back reading (buried deep in some forum somewhre) a comment that stated something like "only Swift has gotten it right." I decided to investigate and indeed, in my opinion, Swift has the most faithful and robust Unicode string and character implementation. Characters are extended grapheme clusters, and indexing and length are what you'd expect. They have functions to access different views of a string, like its Unicode scalars and code units in UTF-16 or UTF-8. Additionally, the indexes that are returned from functions like firstIndex can be used among the different views. Even string equality comparison using the == operator uses normalization to correctly compare Unicode strings regardless of combining marks. Impressive work! https://developer.apple.com/documentation/swift/string