List.sort() very slow if list is already sorted or nearly so.

PureFox48 commented 1 year ago

I noticed recently that List.sort(), which is based on the Quicksort algorithm, is very slow indeed when applied to lists which are already sorted or nearly so.

For example, this code takes around 5.7 seconds to run on my Core i7 machine:

var list = (1..10000).toList
list.sort()

I tracked down the problem to the choice of pivot in the private partition_ method in wren_core.wren:

partition_(low, high, comparer) {
    var p = this[high]
    // blah
}

If instead of always using the high value we use the mid value:

partition_(low, high, comparer) {
    var mid = ((low + high)/2).floor
    this.swap(mid, high)
    var p = this[high]
    // blah
}

then the code now runs in around 0.014 seconds. Moreover, the effect of adding these two lines has negligible effect on the time taken to sort unsorted lists.

The implementation also has a problem when the list contains repeated elements. For example this code takes around 2.8 seconds to run whatever the pivot choice:

var list = [1] * 10000
list.sort()

However, I can't think of a simple way to solve this particular problem which probably occurs much less often than the 'already sorted' one in practice. If anyone has an idea on this, please post it.

These, of course, are known issues with Quicksort generally which is also very slow when sorting small lists. The latter could be solved by using an insertion sort for lists under 10 elements say but, frankly, I doubt whether it's worth the effort in a simple scripting language such as Wren.

ruby0x1 commented 1 year ago

For the last snippet I was wondering. I'm assuming so but did you measure only the sort or both lines? If not what does this version measure at?

var list = List.filled(10000, 1)
list.sort()

ruby0x1 commented 1 year ago

Thanks for digging btw, we can probably add some tests around this kinda thing to give us a baseline to compare with and validate against.

PureFox48 commented 1 year ago

I just measured the sort.

The time to create the list is virtually the same at 0.0183 seconds whether one uses: List.filled or the List.* operator so possibly they're using the same code under the hood.

Incidentally, the Wikipedia article on Quicksort suggests that using a 'median of three' may be better than a simple midpoint. It was actually a shade slower in my own tests possibly because the cost of interpreting the extra code outweighed any efficiency gained.

PureFox48 commented 1 year ago

Oops, sorry, I messed up the timings for the list creation.

List.* takes 0.00092 seconds but List.filled only takes around 1 millionth of a second. However, either way, it's still insignificant compared to the sort time of 2.8 seconds.

mhermier commented 1 year ago

Did you tried with a more dumb algorithm, like bubble sort? That way it would use less stack memory and could compute a little less.

ruby0x1 commented 1 year ago

here's one I had in my libs laying around

    static bubble_sort(list, compare) {
      if(list.count == 0) return
      var done = false
      while(!done) {
        done = true
        for(i in 0 ... list.count-1) {
          var comp = compare.call(list[i], list[i+1])
          if(comp > 0) {
            var tmp = list[i]
            list[i] = list[i+1]
            list[i+1] = tmp
            done = false
          }
        }
      }
    } //bubble_sort

PureFox48 commented 1 year ago

Using the following optimized implementation of an ascending bubble sort:

var bubbleSort = Fn.new { |a|
    var n = a.count
    while (true) {
        var n2 = 0
        for (i in 1...n) {
            if (a[i - 1] > a[i]) {
                a.swap(i, i - 1)
                n2 = i
            }
        }
        n = n2
        if (n == 0) break
    }
}

the results are a bit variable but it's typically taking around 0.0008 seconds for both the 'already sorted' and 'all the same' cases.

@Ruby0x1's version is slightly slower than this.

As expected, this is much quicker than Quicksort even with the modified pivot but, of course, it would be a completely different story if the list contained random values.

ruby0x1 commented 1 year ago

one has a predicate so not a direct comparison (and not that it matters grand scheme, interesting none the less!)

PureFox48 commented 1 year ago

After reading through the salient parts of the Wikipedia article, I wonder whether the best solution here would be to change the partition scheme from Lomuto (which we seem to have at present) to Hoare?

On the face of it, this would give reasonable (albeit sub-optimal) performance for the 'already sorted' and 'all the same' cases and should be no slower for unsorted data. The interface of List.sort would be the same and so this change would be invisible to users.

The alternative would be to stick with what we have, changing the pivot choice to midpoint (or median of three) to give reasonable performance for already sorted lists and just accept the hit on cases where a lot of the data is the same on the grounds that this shouldn't occur very often in practice. We could perhaps put a warning in the docs about this so that knowledgeable users could use an alternative algorithm for this kind of scenario.

ruby0x1 commented 1 year ago

If the Hoare version works better in all cases @PureFox48, then it makes sense to consider it. A PR would be the ideal path then, if you're up for it.

PureFox48 commented 1 year ago

OK, when I have a clearer head, I'll try and do a Hoare version and see how it compares.

mhermier commented 1 year ago

Do you have numbers with random values with Buble sort? I mean, the stress on the stack is less heavy with it. And in our case it could be enough unless we have a very large amount of data.

mhermier commented 1 year ago

You optimized version can optimized a little bit more with: while (n != 0)and removing: if (n == 0) break.

PureFox48 commented 1 year ago

Both bubble sort and insertion sort are hopelessly slow when sorting random data even for lists as small as 100 elements.

Check out the Wren results for this Rosetta code task, comparing the performance of various sorting methods, which I did some time ago.

To keep things simple, I think we must stick with quicksort for the List.sort method which is easily the fastest algorithm for random data of non-trivial size and should (with Hoare partitioning which I'm working on just now) still give reasonable results even when the data is already sorted or all the same.

PureFox48 commented 1 year ago

OK, I've done the Hoare version and it seems to be working OK.

To make it easier to play about with them, I've put it (and also the modified Lomuto code) in separate classes for now.

import "random" for Random

// Lomuto partitioning (as per List.sort()) but with midpoint pivot.
class List2 {
  static sort(a) { sort(a) {|low, high| low < high } }

  static sort(a, comparer) {
    if (!(comparer is Fn)) {
      Fiber.abort("Comparer must be a function.")
    }
    quicksort_(a, 0, a.count - 1, comparer)
    return a
  }

  static quicksort_(a, low, high, comparer) {
    if (low < high) {
      var p = partition_(a, low, high, comparer)
      quicksort_(a, low, p - 1, comparer)
      quicksort_(a, p + 1, high, comparer)
    }
  }

  static partition_(a, low, high, comparer) {
    var mid = ((low + high)/2).floor // added this line
    a.swap(mid, high)                // ditto
    var p = a[high]
    var i = low - 1
    for (j in low..(high-1)) {
      if (comparer.call(a[j], p)) {
        i = i + 1
        var t = a[i]
        a[i] = a[j]
        a[j] = t
      }
    }
    var t = a[i+1]
    a[i+1] = a[high]
    a[high] = t
    return i+1
  }
}

// Hoare partitioning as per Wikpedia article.
class List3 {
  static sort(a) { sort(a) {|low, high| low < high } }

  static sort(a, comparer) {
    if (!(comparer is Fn)) {
      Fiber.abort("Comparer must be a function.")
    }
    quicksort_(a, 0, a.count - 1, comparer)
    return a
  }

  static quicksort_(a, low, high, comparer) {
    if (low >= 0 && high >= 0 && low < high) {
      var p = partition_(a, low, high, comparer)
      quicksort_(a, low, p, comparer)
      quicksort_(a, p+1, high, comparer)
    }
  }

  static partition_(a, low, high, comparer) {
    var mid = ((low + high)/2).floor
    var p = a[mid]
    var i = low - 1
    var j = high + 1
    while (true) {
      while (true) {
        i = i + 1
        if (!comparer.call(a[i], p)) break
      }
      while (true) {
        j = j - 1
        if (!comparer.call(p, a[j])) break
      }
      if (i >= j) return j
      a.swap(i, j)
    }
  }
}

var n = 20
var sorted = (1..n).toList
var same = List.filled(n, 1)
var rand = Random.new()
var unsorted = sorted.toList
rand.shuffle(unsorted)

System.print("Tests for List3.sort (Hoare):\n")

System.print(sorted)
List3.sort(sorted)
System.print(sorted)
System.print()

System.print(same)
List3.sort(same)
System.print(same)
System.print()

var unsorted2 = unsorted.toList // make a copy
System.print(unsorted)
List3.sort(unsorted) { |a, b| a < b }
System.print(unsorted)
System.print()

System.print(unsorted2)
List3.sort(unsorted2) { |a, b| a > b } // descending sort
System.print(unsorted2)

/* output:
Tests for List3.sort (Hoare):

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

[8, 12, 14, 19, 15, 1, 20, 11, 3, 6, 2, 16, 13, 18, 5, 10, 9, 4, 17, 7]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

[8, 12, 14, 19, 15, 1, 20, 11, 3, 6, 2, 16, 13, 18, 5, 10, 9, 4, 17, 7]
[20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
*/

I'm doing some comparative benchmarks and will post those shortly.

PureFox48 commented 1 year ago

OK, here's the performance figures for the existing List.sort method (Lomuto) with the midpoint pivot:

Benchmarks for List2.sort (modified Lomuto):

Running 'Sorted data size 10' over 10 iteration(s):

Best 0.004 ms Mean 0.004 ms Worst 0.007 ms

Running 'Same data size 10' over 10 iteration(s):

Best 0.005 ms Mean 0.007 ms Worst 0.019 ms

Running 'Random data size 10' over 10 iteration(s):

Best 0.004 ms Mean 0.008 ms Worst 0.024 ms

Running 'Sorted data size 100' over 10 iteration(s):

Best 0.060 ms Mean 0.064 ms Worst 0.096 ms

Running 'Same data size 100' over 10 iteration(s):

Best 0.304 ms Mean 0.309 ms Worst 0.340 ms

Running 'Random data size 100' over 10 iteration(s):

Best 0.076 ms Mean 0.080 ms Worst 0.110 ms

Running 'Sorted data size 1000' over 10 iteration(s):

Best 0.840 ms Mean 0.850 ms Worst 0.908 ms

Running 'Same data size 1000' over 10 iteration(s):

Best 27.987 ms Mean 28.121 ms Worst 28.585 ms

Running 'Random data size 1000' over 10 iteration(s):

Best 1.144 ms Mean 1.159 ms Worst 1.208 ms

Running 'Sorted data size 10000' over 10 iteration(s):

Best 11.382 ms Mean 11.455 ms Worst 11.632 ms

Running 'Same data size 10000' over 10 iteration(s):

Best 2773.198 ms Mean 2777.167 ms Worst 2792.362 ms

Running 'Random data size 10000' over 10 iteration(s):

Best 16.246 ms Mean 16.354 ms Worst 16.724 ms

PureFox48 commented 1 year ago

...and here's the performance figures for the Hoare partitioning version:

Benchmarks for List3.sort (Hoare):

Running 'Sorted data size 10' over 10 iteration(s):

Best 0.005 ms Mean 0.005 ms Worst 0.007 ms

Running 'Same data size 10' over 10 iteration(s):

Best 0.005 ms Mean 0.006 ms Worst 0.007 ms

Running 'Random data size 10' over 10 iteration(s):

Best 0.005 ms Mean 0.006 ms Worst 0.007 ms

Running 'Sorted data size 100' over 10 iteration(s):

Best 0.065 ms Mean 0.067 ms Worst 0.068 ms

Running 'Same data size 100' over 10 iteration(s):

Best 0.080 ms Mean 0.081 ms Worst 0.083 ms

Running 'Random data size 100' over 10 iteration(s):

Best 0.084 ms Mean 0.085 ms Worst 0.086 ms

Running 'Sorted data size 1000' over 10 iteration(s):

Best 0.839 ms Mean 0.846 ms Worst 0.901 ms

Running 'Same data size 1000' over 10 iteration(s):

Best 1.080 ms Mean 1.100 ms Worst 1.167 ms

Running 'Random data size 1000' over 10 iteration(s):

Best 1.183 ms Mean 1.217 ms Worst 1.303 ms

Running 'Sorted data size 10000' over 10 iteration(s):

Best 10.046 ms Mean 10.114 ms Worst 10.231 ms

Running 'Same data size 10000' over 10 iteration(s):

Best 13.034 ms Mean 13.129 ms Worst 13.248 ms

Running 'Random data size 10000' over 10 iteration(s):

Best 14.275 ms Mean 14.372 ms Worst 14.620 m

PureFox48 commented 1 year ago

For good measure, here are some figures for lists containing 100,000 and 1,000,000 random elements.

Benchmarks for existing list.sort (Lomuto):

Running 'Random data size 100000' over 10 iteration(s):

Best 175.197 ms Mean 180.517 ms Worst 184.312 ms

Running 'Random data size 1000000' over 10 iteration(s):

Best 2273.706 ms Mean 2315.149 ms Worst 2338.914 ms

Benchmarks for List2.sort (modified Lomuto):

Running 'Random data size 100000' over 10 iteration(s):

Best 188.390 ms Mean 190.232 ms Worst 192.531 ms

Running 'Random data size 1000000' over 10 iteration(s):

Best 2380.935 ms Mean 2383.646 ms Worst 2392.719 ms

Benchmarks for List3.sort (Hoare):

Running 'Random data size 100000' over 10 iteration(s):

Best 174.246 ms Mean 174.686 ms Worst 175.152 ms

Running 'Random data size 1000000' over 10 iteration(s):

Best 2028.855 ms Mean 2031.921 ms Worst 2036.394 ms

This demonstrates just how fast the quicksort algorithm is, even for large amounts of data and using an interpreted language such as Wren.

PureFox48 commented 1 year ago

Whilst we're at it, I think we should also add ordering operators to the String class so that (amongst other things) we can easily sort strings lexicographically by codepoint. This code should do it:

class String is Sequence {
  /* existing stuff */

  // Compares this to another string lexicographically by codepoint.
  // Returns -1, 0 or +1 depending on whether
  // this < other, this == other or this > other respectively.
  compareTo(other) {
     if (!(other is String)) Fiber.abort("Other must be a string")
     if (this == other) return 0
     var cp1 = this.codePoints
     var cp2 = other.codePoints
     var len = (cp1.count <= cp2.count) ? cp1.count : cp2.count
     for (i in 0...len) {
        if (cp1[i] < cp2[i]) return -1
        if (cp1[i] > cp2[i]) return 1
     }
     return (cp1.count < cp2.count) ? -1 : 1
  }

  // Comparison operators.
  <  (other) { compareTo(other) <  0 }
  <= (other) { compareTo(other) <= 0 }
  >  (other) { compareTo(other) >  0 }
  >= (other) { compareTo(other) >= 0 } 
}

mhermier commented 1 year ago

Please open a separate discussion for this, it can be optimized more than that.

PureFox48 commented 1 year ago

Done. See #1142.

ruby0x1 commented 1 year ago

Thanks for the detailed posts @PureFox48, it's always appreciated ✨

mhermier commented 1 year ago

For reference, can you benchmark the algorithm provided by List, and the Hoare applied to it (It should optimize a little bit more, because functional is a little bit sub-optimal in wren).

PureFox48 commented 1 year ago

The only one we haven't benchmarked so far is the existing list.sort() method which is based on Lomuto partitioning but with the high value pivot. Consequently, it's very slow when the data is already sorted or is the same.The figures for that are given below:

Benchmarks for existing list.sort (Lomuto):

Running 'Sorted data size 10' over 10 iteration(s):

Best 0.007 ms Mean 0.008 ms Worst 0.012 ms

Running 'Same data size 10' over 10 iteration(s):

Best 0.005 ms Mean 0.005 ms Worst 0.006 ms

Running 'Random data size 10' over 10 iteration(s):

Best 0.003 ms Mean 0.004 ms Worst 0.004 ms

Running 'Sorted data size 100' over 10 iteration(s):

Best 0.578 ms Mean 0.583 ms Worst 0.595 ms

Running 'Same data size 100' over 10 iteration(s):

Best 0.296 ms Mean 0.300 ms Worst 0.304 ms

Running 'Random data size 100' over 10 iteration(s):

Best 0.069 ms Mean 0.070 ms Worst 0.070 ms

Running 'Sorted data size 1000' over 10 iteration(s):

Best 53.978 ms Mean 55.671 ms Worst 57.360 ms

Running 'Same data size 1000' over 10 iteration(s):

Best 27.065 ms Mean 27.468 ms Worst 28.218 ms

Running 'Random data size 1000' over 10 iteration(s):

Best 1.151 ms Mean 1.161 ms Worst 1.213 ms

Running 'Sorted data size 10000' over 10 iteration(s):

Best 5471.917 ms Mean 5604.434 ms Worst 5644.828 ms

Running 'Same data size 10000' over 10 iteration(s):

Best 2754.123 ms Mean 2757.615 ms Worst 2764.081 ms

Running 'Random data size 10000' over 10 iteration(s):

Best 14.600 ms Mean 14.767 ms Worst 15.056 ms

List2.sort differs from the above in that it uses a mid-value pivot to remedy the 'already sorted' problem.

List3.sort uses Hoare partitoning to remedy both the 'already sorted' and 'are the same' problems.

To ensure all 3 methods use exactly the same data, I've updated the figures for the other two in my earlier posts.

My overall conclusion is that we should change to the Hoare method which gives reasonable results across the board and seems a shade faster than Lomuto for large lists.

PureFox48 commented 1 year ago

As it seems clear from these figures that a change to Hoare partitioning will be beneficial, I've now opened a PR for it.

mhermier commented 1 year ago

By having the preconditions of count > 1 are the conditions low >= 0 && high >= 0 not always true and could be removed?

PureFox48 commented 1 year ago

The only time when high is less than 0 is when the list is empty and it's then equal to -1. However, lowis then 0 and the condition low < high then fails and the method just returns without doing anything.

So, yes, we can get rid of the low >= 0 && high >= 0 part and I'll change the code in the PR when I have time.

wren-lang / wren