Open PureFox48 opened 3 years ago
While I agree this is sometimes awkward, nothing prevent you from having a separate ByteList class with fromString constructor and toString to achieve the same functionality.
Having it in the core means means that the base language grow, which is not usually wanted.
String is implemented as a read-only structure, because of internals trades of mainly to implement a fast hash map.
toLower and toUpper does not exist because nobody felt the need yet, but it seems legitimate.
From a technical point we can make a mitigated solution, by allowing a List of Num to be converted to a String. this would go like this:
String {
static fromBytes(bytes) { fromBytes_(bytes.toList) }
foreign static fromBytes_(bytes)
}
fromBytes_ would check that the given argument is a List containg only integer numbers compatible with bytes. And construct a tring of it.
About the toLower to Upper, the problem might reside in the fact that we don't use an UTF-8 library to provides a valid conversion. At best we can do a cheap ASCII-7 version of it which might be not enough for some people.
From a technical point we can make a mitigated solution, by allowing a List of Num to be converted to a String.
Yes, as long as the fromBytes_
private method is implemented in C, I'd be happy with that.
About the toLower to Upper, the problem might reside in the fact that we don't use an UTF-8 library to provides a valid conversion.
I agree that's a difficulty. In my own implementations of these methods, I go a bit further than ASCII-7 by dealing with code-points up to 255 which, apart from a few rare characters, is enough to deal with all the major Western European languages. Much improved coverage and still quite cheap to implement .
If we go that route we migth also whant to have a fromCodePoints
Yes, we'd ideally need both to maintain symmetry with the existing String.fromByte
and String.fromCodePoint
methods.
However, even without knowing much about the VM internals, I can see that the latter would be trickier to implement.
For the 'bytes' version, you'd already know the size of the string buffer needed from the size of the list and presumably you could just 'block copy' the byte values into that.
For the 'codePoints' version, you'd need to iterate through the list (whose elements could now go up to 0x10ffff
rather than just 0xff
) to figure out how many bytes would be needed to represent that element in UTF-8 and then allocate a string buffer big enough to accommodate the total. You'd then need to iterate through the list again and embed the bytes for each element, one by one, into the buffer. There may be ways to optimize this process but it's clearly going to be much slower than the 'bytes' version.
If the list contained any 'out of range' values, that would have to be a runtime error as it is now for the existing single byte/code-point conversion methods.
Incidentally, if we had a String.fromCodePoints
method, then it might be quick enough to implement String.lower
and String.upper
methods in Wren itself.
What do you think about just going up to code-point 255 for these which wouldn't require a table or some external Unicode resource?
I'm thinking, things can be a little bit more complicated than that, and maybe we should be only utf-8 encoding compatible and not compliant. This would removed a lots of headache.
maybe we should be only utf-8 encoding compatible and not compliant
Can you clarify, as I'm not sure what you mean by that.
Do you perhaps mean that out of range values could be dealt with by masking rather than throwing runtime errors?
I mean allowing all the possible codepoints that the encoding allows, it means no baning of some ranges due to some code pages due to UTF-16 and up to 8 (or so) byte long encodings.
Yes, but isn't it going to seem odd to people if String.fromCodePoints
isn't completely consistent with String.fromCodePoint
?
The latter is after all just a degenerate case (for one code-point) of the former.
The idea is to reuse things, but to extends things to allow all encodings. The compliance was not a goal, so why not go all the way?
I'm not sure that I see much practical value in that as I don't know why anyone would be interested in, say, UTF-16 unless they were trying to embed Wren in a Java or C# application.
But, as long as the existing String.fromCodePoint
method were extended in a similar way, I'd have no particular problem with it.
fromBytes
is done. Need polish, doc and tests.
Great stuff!
Possible docs:
String.fromBytes(bytes)
Creates a new string consisting of each byte in a List
of bytes
.
System.print(String.fromBytes([87, 114, 101, 110]) //> Wren
It is a runtime error if bytes
is not a List
or if any byte is not an integer between 0 and 0xff, inclusive.
fromCodePoint
is done but I have to dig some obscure bug where the stack seems not to be correct..... and will do the polish later.
Code is done (in a style that will probably rejected, but it is made to allow error return values) and polished (unless I found a really odd corner case in test, but it is so trivial that I have doubts). Missing conformance tests, documentations yet. Full encoding of max theoretical needs to be done in a separated patch.
One thing I find rather awkward in Wren is manipulating strings as byte buffers. Although converting a string to a mutable byte list is easy enough, converting it back again after manipulation is not so nice as the following simple case conversion shows:
On the face of it, you're having to convert each byte to a string and then concatenating them all together which is hardly efficient.
I'm not sure how strings are implemented in Wren - length prefixed byte buffers? - but I'd have thought there must be a much more efficient way of doing this from the C side of things.
I realize of course that a
String.fromBytes
method would imply an ASCII encoding (andString.fromCodePoints
a UTF-8 encoding) but I make no apologies for that as these are by far the commonest string encodings in current use.Is there perhaps some technical difficulty here as I've always found it strange that Wren doesn't have
String.upper
andString.lower
methods which even languages with minimal standard libraries usually have. Although these are easy enough to write in Wren itself, they're inefficient and very slow for large strings.I'd appreciate any comments.