[Proposal] Add List.distinct

PureFox48 commented 3 years ago

A useful method which many languages have is a distinct or unique method which removes duplicates from a list or array and I'd like to propose that we add something similar to Wren.

The most obvious way to implement this would be something like the following which makes use of the new List.indexOf mathod:

distinct {
    var result = []
    for (element in this) {
        if (result.indexOf(element) == -1) result.add(element)
    }
    return result
}

This has the advantage that the original ordering is maintained, doesn't require much code but does require a lot of computation.

I have seen this kind of method implemented either by first sorting the list or by using a map to weed out the duplicates. However, a major problem with these approaches is that the original ordering is lost and, for sorting, you need to provide a comparer function. Using a map is quick but only certain types of object are suitable as map keys so you'd probably have toString everything else.

So two questions. Do you agree that a distinct method would be worth adding and, if so, can you think of a more efficient way to implement it?

mhermier commented 3 years ago

Problem is not about agreeing (even if I don't have the final decision), but more about: does it have a broad usage to justify inclusion in core.

You can use contains instead of adding indexOf. An overload might be required to allow user to provide an equality function.

PureFox48 commented 3 years ago

does it have a broad usage to justify inclusion in core.

All I can say is that it's come up a lot in my own code using various languages over the years.

You can use contains instead of adding indexOf.

It appears that @ruby0x1 has already added List.indexOf (rightly IMHO) as it's mentioned in the 0.4.0 Changelog. So we might as well use it.

An overload might be required to allow user to provide an equality function.

Possibly, though default equality for the type in question should normally work fine.

mhermier commented 3 years ago

Well using indexOf instead of contains is the difference between making it a Sequence method and a List method.

PureFox48 commented 3 years ago

That's a good point.

Making it a Sequence method rather than a List method would be better as it's then more general.

ChayimFriedman2 commented 3 years ago

Sequences can be infinite.

Many languages requires the list to be sorted already. This has both the advantages: using faster algorithm and not requiring to provide a comparer, along with being faster if the list is already sorted.

Edit: Languages that require to be sorted:

Rust: Vec::dedup().
C++: std::unique().

mhermier commented 3 years ago

True, but you might want to sort you list according to another criteria that the default one.

ChayimFriedman2 commented 3 years ago

If you leave the sorting task up to the user, he can sort however he wants.

PureFox48 commented 3 years ago

I can think of 4 other languages where the list doesn't have to be sorted: C#, Java, Kotlin and Ruby.

To my mind, the unsorted case is definitely the most useful though and, as @mhermier said earlier, you might not want to use the default sort order so sorting is probably best left as a separate step.

jube commented 3 years ago

In C++, std::unique does not requires the input to be sorted. Like all the other functions mentionned, it removes consecutive duplicates except the first one. Like uniq in shell scripts. This function can be implemented in O(n).

The function in the first comment does not do this, it removes duplicates that may not be consecutive, which is very misleading for all people that know this function in other languages. Moreover, it's implemented in O(n^2).

For @ChayimFriedman2, the documentation of the Rust function does not require the input to be sorted, it only says that if the input is sorted, all duplicates are removed.

PureFox48 commented 3 years ago

The distinct functions in C#, Java and Kotlin remove all duplicates (not just consecutive duplicates) as can be readily seen from their documentation:

Even if I'd never used these functions before, this is what I'd expect them to do because distinct means that the resulting elements are not the same.

Ruby's uniq function also removes all duplicates.

You're correct about what C++'s unique and Rust's dedup functions do but, as they're using different names (and the former is confusingly named IMO in any case) I don't see why this would mislead people if Wren were to use Sequence.distinct.

ChayimFriedman2 commented 3 years ago

In C++, std::unique does not requires the input to be sorted. Like all the other functions mentionned, it removes consecutive duplicates except the first one. Like uniq in shell scripts. This function can be implemented in O(n).

The function in the first comment does not do this, it removes duplicates that may not be consecutive, which is very misleading for all people that know this function in other languages. Moreover, it's implemented in O(n^2).

For @ChayimFriedman2, the documentation of the Rust function does not require the input to be sorted, it only says that if the input is sorted, all duplicates are removed.

I know that, but like @PureFox48 said, because of their naming it seems that their primary purpose is to remove all duplicates - which is done only when the input is sorted. The "remove consecutive duplicates" is just an attempt to standardize what they do when the input is not sorted.

jube commented 3 years ago

My point is that a function like the one proposed in this issue is totally under-optimised for most cases (if not all cases). As I said before, it runs in O(n^2) as implemented above. But do you really need to keep the order of the elements? In many cases no, so the best strategy would be to first sort the sequence (in O(n log n)) and then use a function like unique in O(n). Total: O(n log n). Much better than O(n^2). And in some cases, you already know that your input is sorted and you can call unique directly. Total: O(n). So, I would like to know what the use case for keeping the remaining elements orderered is (you are removing some similar elements that were not in this position, so you do not really care about the position of the remaining elements), and hence, what the point of distinct as proposed is.

MasFlam commented 3 years ago

What could be done is have the Sequence method implement the less efficient algorithm, and since we have List.sort(comp) already, the distinct method on lists could use that, or remember if the list is already sorted and chose which algorithm to use.

mhermier commented 3 years ago

It can be under optimised, but it preserve the order of the elements and this can matter. Imagine I want a list of unique random int, having it sorted defeat the interest of having a random sequence.

If things needs to be fast, one can use a Bag/Set/(Unique)SortedList feeded from the list.

mhermier commented 3 years ago

Maybe then we need a distinct and distinctSort then, and that one could be optimised, by any means?

MasFlam commented 3 years ago

I kind of agree you might not want to have your list/sequence sorted automatically, but then it might still make sense to maybe keep track of whether a list is sorted to get a better complexity in case it is. Then if the user doesn't mind they can do .sort().distinct. Otherwise the action is O(n^2), but with a tweak it can be made just a little bit faster: in each iteration only check if result.contains(element) instead of this.contains(element). This doesn't change the asymptotic complexity, although makes it iterate through a smaller list each iteration and still yields a correct result.

Edit: Checking if the previous element is different than the current one is also a nice heuristic.

PureFox48 commented 3 years ago

Perhaps the best solution here would be for distinct to take a parameter - alreadySorted say.

The implementation could then use an appropriate algorithm for the case in point.

We could also have a second method (a simple getter) which called the primary method with alreadySorted set to false.

Of course, it would be up to the user to decide whether his/her sequence/list was already sorted or not which I think is fair enough.

mhermier commented 3 years ago

I don't like that idea, it is too easy for the user to produce false result/assumptions. And in addition, it is harder to implement optimized versions of the behavior.

That make me thing we can need isSorted and 'isSorted(comparison_fn)' functions.

PureFox48 commented 3 years ago

Sure, the user can get it wrong but the same could be said about binary search functions which assume the sequence is already sorted and you get strange results if they're not.

Although I wouldn't object to having an isSorted function, I think it would be too expensive for distinct to have to call this every time to decide which algorithm to use.

Incidentally, what is clear as far as sorting is concerned is that we badly need lexicographical ordering for strings and for them to support the ordering operators. The new List.sort method is going to be hamstrung without this.

ChayimFriedman2 commented 3 years ago

I think the solution should be made up of three parts:

Add List.distinct.
Add List.removeConsecutiveDuplicates (probably with a better name).
Add sets (that was mentioned previously).

PureFox48 commented 3 years ago

I don't think that adding sets would necessarily help here. The most natural implementation for Wren would be to use a map where, say, values are always set to true. However, you'd then have the problem that only value types can be used as keys and iteration order would be all over the place.

mhermier commented 3 years ago

This go way beyond the topic of this issue, and add way to much code to core for my taste, even ignoring Set. I think we should concentrate on doing a bare minimum functional guarantee in core even if it means having O(n^2) complexity. So it would just be adding:

class Sequence {
  ...

  distinct { ... }
  distinct(equalFn) { ... }
}

Making distinct lazy could be an interesting (required?) implementation choice for the default implementation, since it can deal with infinite Sequence. Memoization could be a plus, but I wonder how beneficial/expensive it could be. Once that is established, sub-classes optimizations can be thought.

I opened #900 to deal with with Sequence::isSorted because it should be an interesting addition, but should not be tightly coupled with this change because of infinite Sequence.

PureFox48 commented 3 years ago

I think we should concentrate on doing a bare minimum functional guarantee in core even if it means having O(n^2) complexity.

I'd be comfortable with that as I think we've become too side-tracked with the sorting issue which, to my mind, is best left as a separate operation when needed.

If it's going to be a Sequence method, then I think the default implementation should definitely be lazy though I'm not so sure about memoization. Do we use that for any of the existing methods?

mhermier commented 3 years ago

No currently it is not used anywhere. It can be added later with a dedicated lazy class if required I think.

wren-lang / wren

[Proposal] Add List.distinct #899