wren-lang / wren

The Wren Programming Language. Wren is a small, fast, class-based concurrent scripting language.
http://wren.io
MIT License
6.93k stars 555 forks source link

[RFC] Add comparison operators to the String class #1142

Open PureFox48 opened 1 year ago

PureFox48 commented 1 year ago

Currently, if you need to compare two strings or sort a list of strings lexicographically by codepoint, then you need to write your own comparison routine. This is an unsatisfactory state of affairs since you can't add your own comparison operators (or anything else) to the built-in String class or inherit from it.

I therefore propose that we add a compareTo method and comparison operators to the String class so that it will be as easy to compare two strings or sort a list of strings as it is to perform the same operations on numbers.

As we need to work at the codepoint rather than the byte level, it will probably be quick enough to do this from the Wren side and here's my attempt at some code which is essentially what I've been using myself for the last 3 years or so, albeit in the guise of static methods of a proxy class:

class String is Sequence {
  /* existing stuff */

  // Compares this to another string lexicographically by codepoint.
  // Returns -1, 0 or +1 depending on whether
  // this < other, this == other or this > other respectively.
  compareTo(other) {
     if (!(other is String)) Fiber.abort("Other must be a string")
     if (this == other) return 0
     var cp1 = this.codePoints
     var cp2 = other.codePoints
     var len = (cp1.count <= cp2.count) ? cp1.count : cp2.count
     for (i in 0...len) {
        if (cp1[i] < cp2[i]) return -1
        if (cp1[i] > cp2[i]) return 1
     }
     return (cp1.count < cp2.count) ? -1 : 1
  }

  // Comparison operators.
  <  (other) { compareTo(other) <  0 }
  <= (other) { compareTo(other) <= 0 }
  >  (other) { compareTo(other) >  0 }
  >= (other) { compareTo(other) >= 0 } 
}

I look forward to receiving comments on this proposal and any suggestions for improving the code.

mhermier commented 1 year ago

You can implement compareTo in C for better performance and to handle handle BOM prelude.

But I would go optimize even furzer, and directly use strcmp. Since regular UTF-8 encoding is compatible with strcmp (ignoring some edge cases like BOM, characters with multiple encoding, sub-optimal encoding...), it should provide a pretty reliable and fast comparison.

mhermier commented 1 year ago

Thinking at it again we might need to use memcmp instead. Since Strings known their size and since \0 is a legal UTF-8 char, I think it is a better solution. That way, we can also compare strings holding binary data.

PureFox48 commented 1 year ago

I hadn't really thought about it before, but given the way UTF-8 is encoded, I think you're right that strcmp/memcmp will give the correct results when comparing two UTF-8 strings lexicographically.

I wouldn't have thought that we'd need to worry about the UTF-8 BOM, because if we've picked that up when reading a string from a file (say), then we should have already stripped it off from the Wren side before doing any comparisons.

Anyway, if we're going to implement compareTo in C, then we'll need to add something like the following in the appropriate places in wren_core.c:

DEF_PRIMITIVE(string_compareTo)
{
  if (!validateString(vm, args[1], "Argument")) return false;

  ObjString* str1 = AS_STRING(args[0]);
  ObjString* str2 = AS_STRING(args[1]);

  uint32_t len1 = str1->length;
  uint32_t len2 = str2->length;

  // If strings have equal length, return memcmp result.
  if (len1 == len2) RETURN_NUM(memcmp(str1->value, str2->value, len1));

  // Get minimum length for comparison.
  uint32_t len3 = (len1 <= len2) ? len1 : len2;
  int res = memcmp(str1->value, str2->value, len3);

  // If result is non-zero, just return that.
  if (res) RETURN_NUM(res);

  // Otherwise the shorter string will come first.
  res = (len1 < len2) ? -1 : 1;
  RETURN_NUM(res);
}

...

PRIMITIVE(vm->stringClass, "compareTo(_)", string_compareTo);
mhermier commented 1 year ago

I would compact it a little bit more.

I would:

PureFox48 commented 1 year ago

OK, I've incorporated your first two suggestions.

I'd already dealt with the third as I realized just after I'd posted it that we could only guarantee the sign.


DEF_PRIMITIVE(string_compareTo)
{
  if (!validateString(vm, args[1], "Argument")) return false;

  ObjString* str1 = AS_STRING(args[0]);
  ObjString* str2 = AS_STRING(args[1]);

  uint32_t len1 = str1->length;
  uint32_t len2 = str2->length;

  // Get minimum length for comparison.
  uint32_t minLen = (len1 <= len2) ? len1 : len2;
  int res = memcmp(str1->value, str2->value, minLen);

  // If result is non-zero, just return that.
  if (res) RETURN_NUM(res);

  // If the lengths are the same, the strings must be equal
  if (len1 == len2) RETURN_NUM(0);

  // Otherwise the shorter string will come first.
  res = (len1 < len2) ? -1 : 1;
  RETURN_NUM(res);
}

...

PRIMITIVE(vm->stringClass, "compareTo(_)", string_compareTo);
mhermier commented 1 year ago

We can use something more compact, by doing something like:

// Use a variation of https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign
#define intcmp(a, b) (((a) > (b)) - ((a) < (b)))

RETURN_NUM(res != 0 ? res : intcmp(len1, len2));

Not sure if it should be a macro or specialized function since we might want to have at least Num to be comparable.

Speaking of which maybe we might want to use <=> operator instead of compareTo or any other name.

PureFox48 commented 1 year ago

Well, we can do something like that if you want to keep the line count down but it's hardly making the code easier to read.

Personally, I'll be surprised if String.compareTo is used directly very much anyway. I think people will just use the standard comparison operators instead which, of course, will call it under the hood.

Going off topic slightly, I think a better use for a new operator such as <=> would be as a 'swap' function. So we'd have:

var a = 1
var b = 2
a <=> b
System.print([a, b]) // [2, 1]

Currently, we have nothing like this in the language for scalars and it's impossible to write one as we don't have parameter passing by reference. You either have to use a temporary variable (3 lines) or (for numbers) a one line hack such as::

var a = 1
var b = 2
b = a + b - (a = b)
System.print([a, b]) // [2, 1]

I noticed recently that the supercomputer language, Chapel, was using <=> for this purpose so we'd be in good company :)

Python, Ruby and Lua use multiple assignment for swapping but I don't like that solution much as it introduces problems of its own.

mhermier commented 1 year ago

About the ease of read, it is subjective. My solution at least express the intent in code, so I find it simpler than reading comments. But it is a minor detail, as long as the code working, it is fine. For future proofing, I would use size_t instead of uint32_t, while overkill it should have only a limited impact (since it is stack/register memory and not allocated memory), and if we need to change string size to size_t it will be one site less to update...

While I find the idea interesting for <=> to be swap, I think it is too much limited in the case of the wren language. Since we only simulated left values (via operators (...)=(value) and [...]=(value)) it would only be applicable on bare variable names. So it seems to be too much restricted to avoid the 3 lines boiler plate required to manually implement swap, and could be taking advantage of our simulated left values (at a non negligible or avoidable cost).

PureFox48 commented 1 year ago

OK, I'll change the type of the lengths to size_t and leave the rest of it alone.

The way I envisaged a swap operator working is analogous to = except, yes, you'd need a variable on both sides. You could still use it as expression but it wouldn't be overloadable.

But you're probably right that it wouldn't be worth the effort of implementing all this just to save two lines of code. Anyway it's just an idea I had :)

PureFox48 commented 1 year ago

Just to finish up on the 'swap' thing, the nearest I can get to a generic swap function, using what we currently have, is this:

import "meta" for Meta

var Swap = Fn.new { |a, b|
  var s = "{\n"
  s = s + "  var t = %(a)\n"
  s = s + "  %(a) = %(b)\n"
  s = s + "  %(b) = t\n"
  s = s + "}\n"
  Meta.compile(s).call()
}

var a = 1
var b = 2
Swap.call("a", "b")
System.print([a, b])

var c = 3
var d = 4
Swap.call("c", "d")
System.print([c, d])

var e = [5, 6]
Swap.call("e[0]", "e[1]")
System.print(e)

/* output:
[2, 1]
[4, 3]
[6, 5]
*/

The generated code is placed in its own block to ensure the temporary variable 't' is destroyed after each invocation.

Note though that you have to pass the variable names as strings. Although it works for indexed variables, we don't really need this as we have list.swap.

PureFox48 commented 1 year ago

Incidentally, it struck me later that the Swap() function in my previous post would not be a great solution as it relies on capturing the swapped variables and will only work at top level if defined in the same module.

Nor will it work if called from within a function or method - not much does with Meta.compile and friends :(

PureFox48 commented 1 year ago

I've now opened a PR to implement string comparisons on the basis agreed above.