Open PureFox48 opened 1 year ago
You can implement compareTo
in C for better performance and to handle handle BOM prelude.
But I would go optimize even furzer, and directly use strcmp
. Since regular UTF-8 encoding is compatible with strcmp
(ignoring some edge cases like BOM, characters with multiple encoding, sub-optimal encoding...), it should provide a pretty reliable and fast comparison.
Thinking at it again we might need to use memcmp
instead. Since String
s known their size and since \0
is a legal UTF-8 char, I think it is a better solution. That way, we can also compare strings holding binary data.
I hadn't really thought about it before, but given the way UTF-8 is encoded, I think you're right that strcmp/memcmp
will give the correct results when comparing two UTF-8 strings lexicographically.
I wouldn't have thought that we'd need to worry about the UTF-8 BOM, because if we've picked that up when reading a string from a file (say), then we should have already stripped it off from the Wren side before doing any comparisons.
Anyway, if we're going to implement compareTo
in C, then we'll need to add something like the following in the appropriate places in wren_core.c:
DEF_PRIMITIVE(string_compareTo)
{
if (!validateString(vm, args[1], "Argument")) return false;
ObjString* str1 = AS_STRING(args[0]);
ObjString* str2 = AS_STRING(args[1]);
uint32_t len1 = str1->length;
uint32_t len2 = str2->length;
// If strings have equal length, return memcmp result.
if (len1 == len2) RETURN_NUM(memcmp(str1->value, str2->value, len1));
// Get minimum length for comparison.
uint32_t len3 = (len1 <= len2) ? len1 : len2;
int res = memcmp(str1->value, str2->value, len3);
// If result is non-zero, just return that.
if (res) RETURN_NUM(res);
// Otherwise the shorter string will come first.
res = (len1 < len2) ? -1 : 1;
RETURN_NUM(res);
}
...
PRIMITIVE(vm->stringClass, "compareTo(_)", string_compareTo);
I would compact it a little bit more.
I would:
(len1==len2)
, It would need some benchmark on
real use cases to prove it is really useful.len3
to minLen
res == -1 || res==1
to res != 0
si memcmp
only guarantee
the sign and not the value.OK, I've incorporated your first two suggestions.
I'd already dealt with the third as I realized just after I'd posted it that we could only guarantee the sign.
DEF_PRIMITIVE(string_compareTo)
{
if (!validateString(vm, args[1], "Argument")) return false;
ObjString* str1 = AS_STRING(args[0]);
ObjString* str2 = AS_STRING(args[1]);
uint32_t len1 = str1->length;
uint32_t len2 = str2->length;
// Get minimum length for comparison.
uint32_t minLen = (len1 <= len2) ? len1 : len2;
int res = memcmp(str1->value, str2->value, minLen);
// If result is non-zero, just return that.
if (res) RETURN_NUM(res);
// If the lengths are the same, the strings must be equal
if (len1 == len2) RETURN_NUM(0);
// Otherwise the shorter string will come first.
res = (len1 < len2) ? -1 : 1;
RETURN_NUM(res);
}
...
PRIMITIVE(vm->stringClass, "compareTo(_)", string_compareTo);
We can use something more compact, by doing something like:
// Use a variation of https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign
#define intcmp(a, b) (((a) > (b)) - ((a) < (b)))
RETURN_NUM(res != 0 ? res : intcmp(len1, len2));
Not sure if it should be a macro or specialized function since we might want to have at least Num
to be comparable.
Speaking of which maybe we might want to use <=>
operator instead of compareTo
or any other name.
Well, we can do something like that if you want to keep the line count down but it's hardly making the code easier to read.
Personally, I'll be surprised if String.compareTo
is used directly very much anyway. I think people will just use the standard comparison operators instead which, of course, will call it under the hood.
Going off topic slightly, I think a better use for a new operator such as <=>
would be as a 'swap' function. So we'd have:
var a = 1
var b = 2
a <=> b
System.print([a, b]) // [2, 1]
Currently, we have nothing like this in the language for scalars and it's impossible to write one as we don't have parameter passing by reference. You either have to use a temporary variable (3 lines) or (for numbers) a one line hack such as::
var a = 1
var b = 2
b = a + b - (a = b)
System.print([a, b]) // [2, 1]
I noticed recently that the supercomputer language, Chapel, was using <=>
for this purpose so we'd be in good company :)
Python, Ruby and Lua use multiple assignment for swapping but I don't like that solution much as it introduces problems of its own.
About the ease of read, it is subjective. My solution at least express the intent in code, so I find it simpler than reading comments. But it is a minor detail, as long as the code working, it is fine.
For future proofing, I would use size_t
instead of uint32_t
, while overkill it should have only a limited impact (since it is stack/register memory and not allocated memory), and if we need to change string size to size_t
it will be one site less to update...
While I find the idea interesting for <=>
to be swap, I think it is too much limited in the case of the wren language. Since we only simulated left values (via operators (...)=(value)
and [...]=(value)
) it would only be applicable on bare variable names. So it seems to be too much restricted to avoid the 3 lines boiler plate required to manually implement swap, and could be taking advantage of our simulated left values (at a non negligible or avoidable cost).
OK, I'll change the type of the lengths to size_t
and leave the rest of it alone.
The way I envisaged a swap operator working is analogous to =
except, yes, you'd need a variable on both sides. You could still use it as expression but it wouldn't be overloadable.
But you're probably right that it wouldn't be worth the effort of implementing all this just to save two lines of code. Anyway it's just an idea I had :)
Just to finish up on the 'swap' thing, the nearest I can get to a generic swap function, using what we currently have, is this:
import "meta" for Meta
var Swap = Fn.new { |a, b|
var s = "{\n"
s = s + " var t = %(a)\n"
s = s + " %(a) = %(b)\n"
s = s + " %(b) = t\n"
s = s + "}\n"
Meta.compile(s).call()
}
var a = 1
var b = 2
Swap.call("a", "b")
System.print([a, b])
var c = 3
var d = 4
Swap.call("c", "d")
System.print([c, d])
var e = [5, 6]
Swap.call("e[0]", "e[1]")
System.print(e)
/* output:
[2, 1]
[4, 3]
[6, 5]
*/
The generated code is placed in its own block to ensure the temporary variable 't' is destroyed after each invocation.
Note though that you have to pass the variable names as strings. Although it works for indexed variables, we don't really need this as we have list.swap
.
Incidentally, it struck me later that the Swap()
function in my previous post would not be a great solution as it relies on capturing the swapped variables and will only work at top level if defined in the same module.
Nor will it work if called from within a function or method - not much does with Meta.compile
and friends :(
I've now opened a PR to implement string comparisons on the basis agreed above.
Currently, if you need to compare two strings or sort a list of strings lexicographically by codepoint, then you need to write your own comparison routine. This is an unsatisfactory state of affairs since you can't add your own comparison operators (or anything else) to the built-in
String
class or inherit from it.I therefore propose that we add a
compareTo
method and comparison operators to theString
class so that it will be as easy to compare two strings or sort a list of strings as it is to perform the same operations on numbers.As we need to work at the codepoint rather than the byte level, it will probably be quick enough to do this from the Wren side and here's my attempt at some code which is essentially what I've been using myself for the last 3 years or so, albeit in the guise of static methods of a proxy class:
I look forward to receiving comments on this proposal and any suggestions for improving the code.