wren-lang / wren

The Wren Programming Language. Wren is a small, fast, class-based concurrent scripting language.
http://wren.io
MIT License
6.86k stars 550 forks source link

[RFC] Introduce String.fromBytes and String.fromCodePoints methods. #916

Open PureFox48 opened 3 years ago

PureFox48 commented 3 years ago

One thing I find rather awkward in Wren is manipulating strings as byte buffers. Although converting a string to a mutable byte list is easy enough, converting it back again after manipulation is not so nice as the following simple case conversion shows:

var s = "wren"
var bytes = s.bytes.toList
var S = bytes.map { |b| String.fromByte(b - 32) }.join()
System.print(S) // WREN

On the face of it, you're having to convert each byte to a string and then concatenating them all together which is hardly efficient.

I'm not sure how strings are implemented in Wren - length prefixed byte buffers? - but I'd have thought there must be a much more efficient way of doing this from the C side of things.

I realize of course that a String.fromBytes method would imply an ASCII encoding (and String.fromCodePoints a UTF-8 encoding) but I make no apologies for that as these are by far the commonest string encodings in current use.

Is there perhaps some technical difficulty here as I've always found it strange that Wren doesn't have String.upper and String.lower methods which even languages with minimal standard libraries usually have. Although these are easy enough to write in Wren itself, they're inefficient and very slow for large strings.

I'd appreciate any comments.

mhermier commented 3 years ago

While I agree this is sometimes awkward, nothing prevent you from having a separate ByteList class with fromString constructor and toString to achieve the same functionality.

Having it in the core means means that the base language grow, which is not usually wanted.

String is implemented as a read-only structure, because of internals trades of mainly to implement a fast hash map.

toLower and toUpper does not exist because nobody felt the need yet, but it seems legitimate.

mhermier commented 3 years ago

From a technical point we can make a mitigated solution, by allowing a List of Num to be converted to a String. this would go like this:

String {
   static fromBytes(bytes) { fromBytes_(bytes.toList) }
   foreign static fromBytes_(bytes)
}

fromBytes_ would check that the given argument is a List containg only integer numbers compatible with bytes. And construct a tring of it.

mhermier commented 3 years ago

About the toLower to Upper, the problem might reside in the fact that we don't use an UTF-8 library to provides a valid conversion. At best we can do a cheap ASCII-7 version of it which might be not enough for some people.

PureFox48 commented 3 years ago

From a technical point we can make a mitigated solution, by allowing a List of Num to be converted to a String.

Yes, as long as the fromBytes_ private method is implemented in C, I'd be happy with that.

About the toLower to Upper, the problem might reside in the fact that we don't use an UTF-8 library to provides a valid conversion.

I agree that's a difficulty. In my own implementations of these methods, I go a bit further than ASCII-7 by dealing with code-points up to 255 which, apart from a few rare characters, is enough to deal with all the major Western European languages. Much improved coverage and still quite cheap to implement .

mhermier commented 3 years ago

If we go that route we migth also whant to have a fromCodePoints

PureFox48 commented 3 years ago

Yes, we'd ideally need both to maintain symmetry with the existing String.fromByte and String.fromCodePoint methods.

However, even without knowing much about the VM internals, I can see that the latter would be trickier to implement.

For the 'bytes' version, you'd already know the size of the string buffer needed from the size of the list and presumably you could just 'block copy' the byte values into that.

For the 'codePoints' version, you'd need to iterate through the list (whose elements could now go up to 0x10ffff rather than just 0xff) to figure out how many bytes would be needed to represent that element in UTF-8 and then allocate a string buffer big enough to accommodate the total. You'd then need to iterate through the list again and embed the bytes for each element, one by one, into the buffer. There may be ways to optimize this process but it's clearly going to be much slower than the 'bytes' version.

If the list contained any 'out of range' values, that would have to be a runtime error as it is now for the existing single byte/code-point conversion methods.

PureFox48 commented 3 years ago

Incidentally, if we had a String.fromCodePoints method, then it might be quick enough to implement String.lower and String.upper methods in Wren itself.

What do you think about just going up to code-point 255 for these which wouldn't require a table or some external Unicode resource?

mhermier commented 3 years ago

I'm thinking, things can be a little bit more complicated than that, and maybe we should be only utf-8 encoding compatible and not compliant. This would removed a lots of headache.

PureFox48 commented 3 years ago

maybe we should be only utf-8 encoding compatible and not compliant

Can you clarify, as I'm not sure what you mean by that.

Do you perhaps mean that out of range values could be dealt with by masking rather than throwing runtime errors?

mhermier commented 3 years ago

I mean allowing all the possible codepoints that the encoding allows, it means no baning of some ranges due to some code pages due to UTF-16 and up to 8 (or so) byte long encodings.

PureFox48 commented 3 years ago

Yes, but isn't it going to seem odd to people if String.fromCodePoints isn't completely consistent with String.fromCodePoint?

The latter is after all just a degenerate case (for one code-point) of the former.

mhermier commented 3 years ago

The idea is to reuse things, but to extends things to allow all encodings. The compliance was not a goal, so why not go all the way?

PureFox48 commented 3 years ago

I'm not sure that I see much practical value in that as I don't know why anyone would be interested in, say, UTF-16 unless they were trying to embed Wren in a Java or C# application.

But, as long as the existing String.fromCodePoint method were extended in a similar way, I'd have no particular problem with it.

mhermier commented 3 years ago

fromBytes is done. Need polish, doc and tests.

PureFox48 commented 3 years ago

Great stuff!

Possible docs:

String.fromBytes(bytes)

Creates a new string consisting of each byte in a List of bytes.

System.print(String.fromBytes([87, 114, 101, 110]) //> Wren

It is a runtime error if bytes is not a List or if any byte is not an integer between 0 and 0xff, inclusive.

mhermier commented 3 years ago

fromCodePoint is done but I have to dig some obscure bug where the stack seems not to be correct..... and will do the polish later.

mhermier commented 3 years ago

Code is done (in a style that will probably rejected, but it is made to allow error return values) and polished (unless I found a really odd corner case in test, but it is so trivial that I have doubts). Missing conformance tests, documentations yet. Full encoding of max theoretical needs to be done in a separated patch.