Closed maxrothman closed 6 months ago
That's not a bad idea. I'd change the default initializer, though, and replace get
with find
in order to account for nil
values.
(defn group-with
([kf cf coll]
(group-with kf cf (cf) coll))
([kf cf init coll]
(reduce (fn [m v]
(let [k (kf v)]
(assoc m k (if-some [kv (find m k)]
(cf init (val kv))
init))))
{}
coll)))
The reason is that aggregation functions typically have a initial value that can be determined by calling the function with zero arguments:
(conj) => []
(+) => 0
(*) => 1
I also wonder, should the kf
be before or after cf
?
(group-with first conj [] coll)
;; or
(group-with conj [] first coll)
I guess following transduce
it would make sense for init
to be third, which would mean cf
would be second, but I'd be interested in hearing your opinion.
replace get with find in order to account for nil values
Agreed, that's an improvement.
change the default initializer...aggregation functions typically have a initial value that can be determined by calling the function with zero arguments
While I see the elegance in that approach, I think it would limit the utility of the function. Several of the examples in the top post would no longer work (or would require more complicated collation functions to work), such as the following:
;; Pick winner based on max-key
(group-with first (partial max-key count) data)
;; Pick first found rather than last found like index-by
(group-with first (fn [a _] a) data)
It's true that in most functions with multiple arities, the shorter arities act as shorthands for the longest one, but the proposed function more follows the model of reduce
, where the arities have slightly different functionality. One way to think about it is that the arity-4 form performs aggregation while the arity-3 form performs selection.
Though the arity-2 form of reduce
is often maligned, I think in this case it makes sense to follow that model. The reason reduce/2
is "bad" is because apply
can do everything it does better because all of the interesting binary-ish functions take varargs. With group-with/3
, the collation function is a little like a comparator: it takes 2 things of the same type and picks one. It's more like a classical binary reduction (reduce op [x1 x2 x3 ...] -> x1 op x2 op x3 op ...
)
I also wonder, should the kf be before or after cf?
I haven't thought about it deeply, but to me it seems natural for cf
to be next to init
, and for init
to be next-to-last, which implies the order kf cf init coll
. Which of these feels more natural to you?
first
, collating by conj
with initial value []
conj
, index by first
, for collation use the initial value []
Even disregarding the awkwardness of having to say collate
twice, I feel like the first matches my mental information hierarchy better.
Finally, I'm interested in your thoughts on what to call this function. group-with
? index-with
? collate-by
?
I've given it some thought, and I believe you're right regarding the 3-arity form. The parallels between the 3-arity group-with
and the 2-arity reduce
make sense.
Regarding what to call it, group-with
makes some sense. group-by-with
would perhaps be more accurate but doesn't read well. We could also call it collate
or collate-by
- the latter you suggested as well. My inclination is to call it collate-by
in order to fit with group-by
and index-by
, while giving the impression of a more generate purpose function.
I agree collate-by
is an accurate description, and correctly puts it in a category with index-by
and group-by
. My only hesitation is, do most potential users know what "collate" means, and do they associate it with the problem they would use the function to solve? Like, if what I knew I was looking for was "a more general group-by" and I ctrl+f'd medley's docs, that wouldn't necessarily turn up collate-by
. But is this a real issue? ¯\_(ツ)_/¯
So what's the next step, should I put a PR together? I confess I don't have a ton of experience with optimizing functions like this, but I'm open to learning.
If we add "group-by" to the docstring, it would turn up the right function if you Ctrl-F'd the docs.
Feel free to put together a PR. Take a quick look over the CONTRIBUTING.md document if you can - it's not very long.
In terms of optimising, reduce
is an efficient iterator, so the only efficiency gain I can see beyond that is to use a transient map:
(defn- collate-by* [kf cf initf coll]
(persistent!
(reduce (fn [m v]
(let [k (kf v)]
(assoc! m k (if-let [kv (find m k)]
(cf (val kv) v)
(initf v)))))
(transient {})
coll)))
(defn collate-by
([kf cf coll]
(collate-by* kf cf identity coll))
([kf cf init coll]
(collate-by* kf cf #(cf init %) coll)))
However, now that I write that out, I'm wondering if having an initf
function instead of an init
value would be a more general purpose function. It doesn't follow the pattern of reduce
, but unlike reduce
we can't easily get the first element of each group. That is, we know that (reduce f coll)
is the same as (reduce f (first coll) (rest coll))
, but we can't easily do this for collate-by
.
Transducing functions do have a 1 argument form, i.e. (conj x) => x
, but in a transducer this is for being called on the final result, rather than the initial value, so the purpose is different.
So would this be better?
(defn collate-by
([keyf collatef coll]
(collate-by keyf collatef identity coll))
([keyf collatef initf coll]
(persistent!
(reduce (fn [m v]
(let [k (keyf v)]
(assoc! m k ((if-let [kv (find m k)]
(collatef (val kv) v)
(initf v)))))
(transient {})
coll))))
The advantage of this design is that the 3-arity functionality is a subset of the 4-arity functionality, and it potentially opens up more use-cases (though I'm unsure what they would be).
The disadvantage is that you'd need to write:
(collate-by kf cf #(cf init %) coll)
But most collections have functions for this already, e.g.
(collate-by kf conj vector coll) ;; group in vectors
(collate-by kf conj hash-set coll) ;; group in sets
(collate-by kf conj list coll) ;; group in lists
If we add "group-by" to the docstring, it would turn up the right function if you Ctrl-F'd the docs.
Good point, we should do the same with index-by
too.
However, now that I write that out, I'm wondering if having an initf function instead of an init value would be a more general purpose function...
To me it seems like the tradeoff here is between familiarity/clarity and generality. On one hand, people already know how reduce
works, and []
is a little easier on the eyes than vector
. Also, not having to use cf
in two places makes it easier to use anonymous functions. On the other, an init function is more general.
My instinct says that though the initf
approach is more general, if we can't come up with a single usecase for it then the demand for it would be pretty rare, which tips the balance towards familiarity/clarity in my view. The only thing I could come up with was that if there was some collection from a library or something that had no 0-arity constructor, then it would be useful. But that seems pretty exotic, and in every other case, using the 0-arity constructor would suffice.
I'll give it some thought - there's no need for a hasty decision - but I'm currently leaning toward initf
.
In either case we'll need a function that uses initf
, as that's the common functionality between the two arities. The only question is whether or not we make that function public, or wrap it in another function that replaces initf
with init
.
My inclination is to avoid wrappers that make the API more familiar, at the cost of distancing ourselves from the underlying implementation. This seems a case where initf
would be "simple" and init
would be "easy", and in general we should favor the former over the latter.
I was reminded of this issue today, and figured i'd check in! Shall we get the ball rolling again?
Having thought more about the initf
vs init
question, I realized one material advantage of initf
over init
: it supports collating into mutable types (e.g. ArrayList, JS arrays, etc.). I think the use case would be rare, but when you needed it, it would be nice to have it there.
On the other hand, the major inconvenience of having to write out #(cf init %)
I think would come not from writing it out per se, but from having to bind cf
to a name so you could reference it in the cf
and initf
args. I do think that inconvenience is significant though, since it forces you to bind an otherwise small anonymous fn to a name and add another level of nesting to the code.
So I think the important question is: how often would it come up? If a type has a constructor that takes at least one argument, then you can always use the constructor rather than writing out an anonymous fn. All Clojure types have constructors like that (even numbers via *
and +
, yay monoids), and the only one I've seen in the wild that doesn't is this interval set library, which provides empty
and mark
fns but no constructor for non-empty sets. There very well might be more, but they're likely to also be somewhat exotic library-provided types.
So this leaves us with a tradeoff between an uncommon use case and another uncommon use case, but at least now I think the tradeoffs are concrete.
I think there are also two possible compromises in addition to the previously discussed approaches:
init
with fn?
and wrap it in a fn if it isn't already in one. This could possibly create a moment of confusion if someone tried to aggregate fns (e.g. with comp
) but this seems so exotic as to not be significant. It's slightly magic, but maybe not so bad?:init []
or :initf vector
. The newish unified call syntax makes this really nice to do, though users of pre-1.11 Clojure wouldn't have that nicety. I think there's also a midway version of this approach: have a 3-arity that accepts an :initf
kwarg and a 4-arity with a static init
arg. That would preserve the convenience of a regular old init
arg for the vast majority of usecases, and let users break out the initf
kwarg in the small number of cases where it was actually needed.Thoughts?
Thanks for the update, and apologies for the delay in getting back to you. After giving it some thought, my preference leans toward initf
implementation. This may be less convenient under certain circumstances, but under most use-cases there should be no difference, and it's the simpler implementation - insofar that the init
version would essentially be a wrapper over the initf
version. I hadn't considered that it would work better with mutable collections, but you're right, and that does seem like another minor point in favour of initf
.
Sounds good, do you want me to submit a PR or would you prefer to do it? I think the code is pretty much already in this issue.
I should be able to do it. The hard part is not the code, now, but the tests :)
Not infrequently I find myself wanting to to group a collection with a custom collation. Between medley and clojure.core we already have two functions with hard-coded collations:
group-by
usesconj
on a vector, andindex-by
uses last-wins collation. Both functions could be implemented in terms of a more general one that also lets devs provide their own collations:Here are some usage examples that compare implementations with and without this function: