Open PureFox48 opened 3 years ago
Problem is not about agreeing (even if I don't have the final decision), but more about: does it have a broad usage to justify inclusion in core.
You can use contains
instead of adding indexOf
. An overload might be
required to allow user to provide an equality function.
does it have a broad usage to justify inclusion in core.
All I can say is that it's come up a lot in my own code using various languages over the years.
You can use
contains
instead of addingindexOf
.
It appears that @ruby0x1 has already added List.indexOf (rightly IMHO) as it's mentioned in the 0.4.0 Changelog. So we might as well use it.
An overload might be required to allow user to provide an equality function.
Possibly, though default equality for the type in question should normally work fine.
Well using indexOf
instead of contains
is the difference between making it a Sequence
method and a List
method.
That's a good point.
Making it a Sequence
method rather than a List
method would be better as it's then more general.
Sequence
s can be infinite.
Many languages requires the list to be sorted already. This has both the advantages: using faster algorithm and not requiring to provide a comparer
, along with being faster if the list is already sorted.
Edit: Languages that require to be sorted:
Vec::dedup()
.std::unique()
.True, but you might want to sort you list according to another criteria that the default one.
If you leave the sorting task up to the user, he can sort however he wants.
I can think of 4 other languages where the list doesn't have to be sorted: C#, Java, Kotlin and Ruby.
To my mind, the unsorted case is definitely the most useful though and, as @mhermier said earlier, you might not want to use the default sort order so sorting is probably best left as a separate step.
In C++, std::unique
does not requires the input to be sorted. Like all the other functions mentionned, it removes consecutive duplicates except the first one. Like uniq
in shell scripts. This function can be implemented in O(n).
The function in the first comment does not do this, it removes duplicates that may not be consecutive, which is very misleading for all people that know this function in other languages. Moreover, it's implemented in O(n^2).
For @ChayimFriedman2, the documentation of the Rust function does not require the input to be sorted, it only says that if the input is sorted, all duplicates are removed.
The distinct
functions in C#, Java and Kotlin remove all duplicates (not just consecutive duplicates) as can be readily seen from their documentation:
Even if I'd never used these functions before, this is what I'd expect them to do because distinct
means that the resulting elements are not the same.
Ruby's uniq function also removes all duplicates.
You're correct about what C++'s unique
and Rust's dedup
functions do but, as they're using different names (and the former is confusingly named IMO in any case) I don't see why this would mislead people if Wren were to use Sequence.distinct
.
In C++,
std::unique
does not requires the input to be sorted. Like all the other functions mentionned, it removes consecutive duplicates except the first one. Likeuniq
in shell scripts. This function can be implemented in O(n).The function in the first comment does not do this, it removes duplicates that may not be consecutive, which is very misleading for all people that know this function in other languages. Moreover, it's implemented in O(n^2).
For @ChayimFriedman2, the documentation of the Rust function does not require the input to be sorted, it only says that if the input is sorted, all duplicates are removed.
I know that, but like @PureFox48 said, because of their naming it seems that their primary purpose is to remove all duplicates - which is done only when the input is sorted. The "remove consecutive duplicates" is just an attempt to standardize what they do when the input is not sorted.
My point is that a function like the one proposed in this issue is totally under-optimised for most cases (if not all cases). As I said before, it runs in O(n^2) as implemented above. But do you really need to keep the order of the elements? In many cases no, so the best strategy would be to first sort the sequence (in O(n log n)) and then use a function like unique
in O(n). Total: O(n log n). Much better than O(n^2). And in some cases, you already know that your input is sorted and you can call unique
directly. Total: O(n). So, I would like to know what the use case for keeping the remaining elements orderered is (you are removing some similar elements that were not in this position, so you do not really care about the position of the remaining elements), and hence, what the point of distinct
as proposed is.
What could be done is have the Sequence
method implement the less efficient algorithm, and since we have List.sort(comp)
already, the distinct
method on lists could use that, or remember if the list is already sorted and chose which algorithm to use.
It can be under optimised, but it preserve the order of the elements and this can matter. Imagine I want a list of unique random int, having it sorted defeat the interest of having a random sequence.
If things needs to be fast, one can use a Bag
/Set
/(Unique)SortedList
feeded from the list.
Maybe then we need a distinct
and distinctSort
then, and that one could
be optimised, by any means?
I kind of agree you might not want to have your list/sequence sorted automatically, but then it might still make sense to maybe keep track of whether a list is sorted to get a better complexity in case it is. Then if the user doesn't mind they can do .sort().distinct
. Otherwise the action is O(n^2), but with a tweak it can be made just a little bit faster: in each iteration only check if result.contains(element)
instead of this.contains(element)
. This doesn't change the asymptotic complexity, although makes it iterate through a smaller list each iteration and still yields a correct result.
Edit: Checking if the previous element is different than the current one is also a nice heuristic.
Perhaps the best solution here would be for distinct
to take a parameter - alreadySorted
say.
The implementation could then use an appropriate algorithm for the case in point.
We could also have a second method (a simple getter) which called the primary method with alreadySorted
set to false.
Of course, it would be up to the user to decide whether his/her sequence/list was already sorted or not which I think is fair enough.
I don't like that idea, it is too easy for the user to produce false result/assumptions. And in addition, it is harder to implement optimized versions of the behavior.
That make me thing we can need isSorted
and 'isSorted(comparison_fn)' functions.
Sure, the user can get it wrong but the same could be said about binary search functions which assume the sequence is already sorted and you get strange results if they're not.
Although I wouldn't object to having an isSorted
function, I think it would be too expensive for distinct
to have to call this every time to decide which algorithm to use.
Incidentally, what is clear as far as sorting is concerned is that we badly need lexicographical ordering for strings and for them to support the ordering operators. The new List.sort method is going to be hamstrung without this.
I think the solution should be made up of three parts:
List.distinct
.List.removeConsecutiveDuplicates
(probably with a better name).I don't think that adding sets would necessarily help here. The most natural implementation for Wren would be to use a map where, say, values are always set to true
. However, you'd then have the problem that only value types can be used as keys and iteration order would be all over the place.
This go way beyond the topic of this issue, and add way to much code to core for my taste, even ignoring Set
. I think we should concentrate on doing a bare minimum functional guarantee in core even if it means having O(n^2)
complexity. So it would just be adding:
class Sequence {
...
distinct { ... }
distinct(equalFn) { ... }
}
Making distinct
lazy could be an interesting (required?) implementation choice for the default implementation, since it can deal with infinite Sequence
. Memoization could be a plus, but I wonder how beneficial/expensive it could be. Once that is established, sub-classes optimizations can be thought.
I opened #900 to deal with with Sequence::isSorted
because it should be an interesting addition, but should not be tightly coupled with this change because of infinite Sequence
.
I think we should concentrate on doing a bare minimum functional guarantee in core even if it means having O(n^2) complexity.
I'd be comfortable with that as I think we've become too side-tracked with the sorting issue which, to my mind, is best left as a separate operation when needed.
If it's going to be a Sequence method, then I think the default implementation should definitely be lazy though I'm not so sure about memoization. Do we use that for any of the existing methods?
No currently it is not used anywhere. It can be added later with a dedicated lazy class if required I think.
A useful method which many languages have is a
distinct
orunique
method which removes duplicates from a list or array and I'd like to propose that we add something similar to Wren.The most obvious way to implement this would be something like the following which makes use of the new
List.indexOf
mathod:This has the advantage that the original ordering is maintained, doesn't require much code but does require a lot of computation.
I have seen this kind of method implemented either by first sorting the list or by using a map to weed out the duplicates. However, a major problem with these approaches is that the original ordering is lost and, for sorting, you need to provide a
comparer
function. Using a map is quick but only certain types of object are suitable as map keys so you'd probably havetoString
everything else.So two questions. Do you agree that a
distinct
method would be worth adding and, if so, can you think of a more efficient way to implement it?