UTF8 support? - Githubissues

pwoolcoc commented 9 years ago

One of my biggest pet peeves with lua is the lack of native unicode support. It is my opinion that there is no reason for a new programming language to still be stuck in ASCII-land. I would be willing to investigate different ways of making wren strings be utf8 by default, instead of ASCII, if it is something you think would be worth putting the time into.

ryanplusplus commented 9 years ago

FYI Lua v5.3 (currently in RC) has "basic" support for utf8: http://www.lua.org/work/doc/manual.html

munificent commented 9 years ago

Yes! My intent has always been for Wren to be based on UTF-8. I've taken a few baby steps there, but there's still a lot of work to do. I think it handles Wren source files being UTF-8 correctly, but the various string operations at runtime still need work.

The main tricky bit is, of course, handling indexing into a string when it uses UTF-8. We'll have to figure out what behavior we think users will want and how efficiently it can be implemented.

pwoolcoc commented 9 years ago

how averse are you to dependencies? ICU is way too much, but there are smaller projects we could use (https://github.com/josephg/librope) that we might be able to leverage for help with string operations.

edsrzf commented 9 years ago

The lexer doesn't currently allow UTF-8 identifiers, but that may be by design; it's unclear.

munificent commented 9 years ago

how averse are you to dependencies?

For better or worse, very averse. I'm not opposed to them in general. I love code reuse. But Wren's charter is to be as minimal, lightweight, and easy to drop into a codebase as possible. Part of that means minimal dependencies.

ICU is probably 100x bigger than all of Wren. :)

I'm not planning to have the core library support any complex Unicode functionality (collation, etc.). For that stuff, users are better off bringing that functionality in themselves. My goal is just to make sure Wren's internal string operations can handle storing Unicode text, and that the operations that are provided don't do something dumb on non-ASCII strings.

I don't think an optimized rope implementation is needed either (though I do think ropes are super cool). Users could always roll their own in Wren if needed.

munificent commented 9 years ago

The lexer doesn't currently allow UTF-8 identifiers, but that may be by design; it's unclear.

By design. From what I've heard from Java and JS folks, Unicode identifiers ended up being a security and maintainability headache. For what it's worth, Ruby only allows ASCII letters in identifiers and Matz and company are Japanese.

Lua apparently uses the current locale to decide which identifiers are valid, which I think means code may not be portable across machines. :open_mouth:

munificent commented 9 years ago

There's still work left to do, but I made a bunch of progress on this:

Non-ASCII characters are allowed in string literals: c5e67953b81a06df9ba00b3746ed3fdf873ab3f3
The existing string methods that already work with UTF-8 (this is one of the really smart things about UTF-8) are now tested: a92e58c804ce5c7770ce735cc7a3cdbed03f2ff5
String subscripting with a number index handles UTF-8 correctly: a5b00cebe788db5b7bda1917106ae93f1327af2a
Strings are iterable and iterate over their code points, not bytes: eb424f5c1acc8351124e1cde3002a7f8b4c5295c

The missing pieces I know of are:

Subscripting a string with a range isn't UTF-8 savvy. This just needs some grunt work to fix. Lots of corner cases to handle.
The count getter on string is unclear. I think it should be split into countBytes (number of bytes) and countCodePoints (ugh, terrible names). The latter would be O(n) since it has to walk the string.

In general, users should be dissuaded from thinking about a string's "length". It probably doesn't mean anything practically useful most of the time. Instead, they should use the higher level methods on string whenever possible (iterating, indexOf, startsWith, etc.)

kmarekspartz commented 9 years ago

This is a size issue, but couldn't a string hold an array of pointers to code points? Then the countCodePoints is the size of that array, and subscripting can be done by code point.

It may be useful to split String into ByteString and String classes...

MarcoLizza commented 9 years ago

Could the count getter simply be split into size (in bytes) and length (in code-points)?

Getting the length and/or size of the strings is crucial in I/O (over network sockets, for example).

pwoolcoc commented 9 years ago

ah, @munificent, you beat me to it! I have a branch almost done with almost everything you just committed. If you haven't started the Range subscripting yet, I volunteer to take that on.

kmarekspartz commented 9 years ago

Could the count getter simply be split into size (in bytes) and length (in code-points)?

Those names might be ambiguous.

MarcoLizza commented 9 years ago

Those names might be ambiguous.

I agree that, perhaps, count may be a better choice over length (also for consistency). But size sounds quite unambiguous, to me.

munificent commented 9 years ago

This is a size issue, but couldn't a string hold an array of pointers to code points? Then the countCodePoints is the size of that array, and subscripting can be done by code point.

Pointers are (at least) 32 bits, so you'd be better off just using UTF-32 at that point. The problem is that that's a horrifically inefficient encoding for most real-world strings. A very large fraction of strings in programs are ASCII. This is true even in programs written for users of other languages, since many strings contain things like IDs and other "internal" stuff that are never shown to humans. Outside of that, the vast majority of strings fit in UTF-16. Until someone starts writing Wren programs that deal with Linear B or heiroglyphics, we'd never need more than 16 bits per character. So allocating 32 bits all the time is just super painful. It wastes memory and it slows things down because it increases cache misses.

UTF-8 is, I think, the best compromise. It's optimally small and fast for most strings. Quite small and fast for strings in modern languages, and doesn't fail under the full weight of Unicode.

The only thing you lose is direct indexing, but I think in practice that doesn't hurt much. For what it's worth, the approach I took here is exactly what Go does, and those guys have thought a lot about this (including being the ones to invent UTF-8 many moons ago).

Could the count getter simply be split into size (in bytes) and length (in code-points)?

That was my first thought. I too think size pretty naturally sounds like "in bytes". But I think this is too likely to confuse people. I want the names to be really unambiguous.

Also, by making the names longer and a bit more awkward to use, it discourages people from thinking about the length of their string, which is good. It's rare that solving a problem with strings should require thinking about their length.

ah, @munificent, you beat me to it! I have a branch almost done with almost everything you just committed.

Oh, crap! I'm so sorry. I was in the shower thinking about how I wanted to handle UTF-8 and I felt like everything clicked so I wanted to get it all implemented before I forgot. Now that I have more contributors (woo!), I need to think about coordinating more!

If you haven't started the Range subscripting yet, I volunteer to take that on.

Yes please!

Speaking of Go, Go also allows strings to be used as arbitrary byte buffers. That means they can contain any byte value, including zero and malformed UTF-8 sequences. That seems pretty useful to me. Now that strings internally store their length (thanks, @edsrzf!) we don't need them to be null-terminated. Something to consider.

edsrzf commented 9 years ago

I considered removing the null terminator from strings, but decided against it since keeping it makes C interoperability easier.

munificent commented 9 years ago

One step closer! String subscripting with ranges works again and is UTF-8 savvy. (To slice a range of raw bytes from a string, we'll add a range to support to the subscript operator on string.bytes.)

The last piece is fixing the length/count getters.

munificent commented 9 years ago

OK, I think I have the count methods and the overall API figured out. See: fe143644b3f96c5e40b230eb4741732c55b63e45

wren-lang / wren

UTF8 support? #68