wren-lang / wren

The Wren Programming Language. Wren is a small, fast, class-based concurrent scripting language.
http://wren.io
MIT License
6.86k stars 550 forks source link

[RFC] Add character literals #1059

Open PureFox48 opened 2 years ago

PureFox48 commented 2 years ago

One thing I've found myself missing lately is character literals. My code seems to have a lot of magic numbers such as 48, 65, 90, 97 amd 122 which is OK when you write it but not so good when you come back to it weeks or months later or when other people are trying to read your code.

When programming in other C-family languages I can generally replace these numbers with the character literals '0', 'A', 'Z', 'a' and 'z' respectively which are self-documenting. Now I know you can do stuff such as "A".bytes[0] to get 65 but it's hardly nice or efficient as you're having to create a String object and call a method on it just to get a number!

I'm not suggesting here that we need a new Char type - we don't - just a simple alias for integer numbers in the code-point range 0 to 0x10ffff. So 'A' and 65 would mean exactly the same thing.

We don't currently use single quotes for anything else (and are unlikely to use them in the future) so there are no compatibility issues here.

A slight difficulty is escape sequences. Clearly we'd need a new one '\'' to represent the single quote character itself as a character literal. However, it would be unnecessary to escape double quotes and the percent sign (no interpolation of course) in character literals so '"' and '%' could be valid.

The question really is whether in string literals should "\'" be valid and whether in character literals should '\"' and '\%' be valid alternatives to their un-escaped versions? My personal opinion is that they should be but I don't have a strong view either way.

So what do y'all think - a worthwhile addition to the language or not?

mhermier commented 2 years ago

@munificent clearly stated that he was against such addition. That said, in particular when you are manipulating strings/parsing the lack of easy conversion between char, string and number is really annoying. So if C char encoding is adopted, I would go all the way with a Char unicode type.

But we can already simulate it with external library, and do like:

Char.fromCodePoint(Char.fromString("a").codePoint + foo)

With trivial use of static maps, it can be quite efficient term of speed and memory (by making Chars immutable).

PureFox48 commented 2 years ago

I wondered whether @munificent had ever said anything about this but couldn't find anything in the back issues.

In my Wren-str module, I actually have aChar class which tests whether a single character string (or the first character of a longer string) falls into a particular category and contains routines for converting between characters and integers. So instead of "A".codePoints[0]] I can do Char.code("A") but, as it's written in Wren, it's just as inefficient and a poor substitute for character literals IMO.

If we had a Char class in the standard library with a bunch of static methods implemented in C , that would be a better, albeit heavier, solution.

mhermier commented 2 years ago

Well my personal taste would be to have a native type for Char. It would solve a bunch of problems, but I wounder how costly it could be in the VM to have this extra native type.

PureFox48 commented 2 years ago

A lot of stuff would need to be altered if we were to add a new native type but the methods themselves should be simple to implement.

Do you remember what @munificent's objection was to character literals?

mhermier commented 2 years ago

From what I remember, it was that String was enough to handle any character needs.

But from my point of view the argument was limited by string subscript access, and the multiple allocations this decision implies...

PureFox48 commented 2 years ago

Well, I think subsequent experience has shown that we need a better bridge between strings and the characters they contain than we have at present.

Another awkward situation we have just now is converting from a list of bytes or code-points to a string which we looked at in some depth in #916.

ruby0x1 commented 2 years ago

My code seems to have a lot of magic numbers such as 48, 65, 90, 97 and 122 which is OK when you write it but not so good when you come back to it weeks or months later or when other people are trying to read your code.

I don't usually find this bit in particular to be a problem because of "enum" classes.

//ascii decimal values for required characters
class Char {
  static Zero { 48 }
  static A { 65 }
  static Z { 90 }
  ...
}
...

... Char.A, Char.Z ...
mhermier commented 2 years ago

While I agree using enum is a valid solution, it loose a little bit of information as a type Char. Personally I prefer to have it wrapped in a Char thin/immutable type and call codePoint on it to retrieve it's numerical value. Because we can make it immutable, we benefit a lot in String convertions at minimum by caching their String representation, and avoid multiple small string reallocations, by taking advantage of String being immutable.

aosenkidu commented 1 year ago

Well my personal taste would be to have a native type for Char. It would solve a bunch of problems, but I wounder how costly it could be in the VM to have this extra native type.

It probably could be a new NAN-tagged type? But I understand NAN-tagging is optional, right?

mhermier commented 1 year ago

The are 2 value implementation in wren. So yes it would mean to have implemented in Nan tagging and in union tagging. Union tagging is more simple, so the issue is more to find room with Nan tagging.

HallofFamer commented 1 year ago

I’ve added 32bit int literals for my implementation of Lox language, whose VM is quite similar to Wren. I’d say that adding a new native atomic type for NAN-tagging shouldn’t be an issue at all, I don’t notice any performance problem either. In theory a total of 8 native atomic types can be implemented, and Wren has 5 right now(NAN, NULL, TRUE, FALSE and UNDEFINED).

A better question is whether this is actually needed. To my understanding, most dynamically typed languages do not have character literals, and for a good reason. This will depend on the future direction for Wren how it will evolve. If there’s plan to make it gradually typed or even statically typed in version 1.0, then it makes sense to add dedicated Char type. If it will stay like a simple dynamically typed language, then it may not be necessary really.

PureFox48 commented 1 year ago

I agree that it's not necessary to have a separate 'Char' type in Wren and all I'm asking for here is character literals as aliases for the corresponding code-point numbers so you don't have to remember what these are.

Having said that, I know that @mhermier would like a native 'Char' type for technical reasons and is incorporating it into his 'veery' language - see #1159.

aosenkidu commented 1 year ago

Small Talk has Strings with only one quote on one side, like this:

iamaSingleWordLitteral. Which is equivalent to 'iamaSingleWordLitteral' (Small Talk uses single quotes for string). So you could type #A as synonym for "A". Not the same thing as a char-type, though.

mhermier commented 1 year ago

While that feature would be cool, unfortunately '#' is used for shebang and attributes. If I remember well, that feature was also discussed somewhere else...

HallofFamer commented 1 year ago

Well # in Smalltalk is called Symbol, basically immutable and interned strings. It’s worth noting that strings are mutable in Smalltalk, which seems to be the norm in that era. It also has character literal with $.