Suggested addition: generalization of group-by/index-by

weavejester / medley

A lightweight library of useful Clojure functions

Eclipse Public License 1.0

870 stars 67 forks source link

Suggested addition: generalization of group-by/index-by #90

Closed maxrothman closed 6 months ago

maxrothman commented 8 months ago

Not infrequently I find myself wanting to to group a collection with a custom collation. Between medley and clojure.core we already have two functions with hard-coded collations: group-by uses conj on a vector, and index-by uses last-wins collation. Both functions could be implemented in terms of a more general one that also lets devs provide their own collations:

;; Placeholder name, this could even be additional arities of index-by
;; Naive implementation, not optimized
(defn group-with
  "key fn, collating fn, initial value, coll"
  ([kf cf coll]
   (reduce #(let [k (kf %2)]
              (assoc %1 k (if-some [v (get %1 k)] 
                            (cf v %2)
                            %2)))
           {} coll))
  ([kf cf init coll]
   (reduce #(update %1 (kf %2) (fnil cf init) %2) {} coll)))

Here are some usage examples that compare implementations with and without this function:

(def data
  ["apple" "banana" "clojure" "aardvark"])

;; Implementing group-by in terms of group-with
(group-by first data)
(group-with first conj [] data)

;; What if you want to use a different collection type?
(reduce #(update %1 (first %2) (fnil conj #{}) %2) {} data)
(-> (group-by first data) (update-vals set))  ;more clear, but inefficient

(group-with first conj #{} data)

;; First, defining a helper. This might also be a useful addition to medley, I'll file a separate issue if you're interested
(defn unapply
 "The opposite of apply: (unapply f [a b] c d) -> (f a b [c d])"
 [f & args]
  (if (and (vector? (first args)) (seq (rest args)))
    ((apply partial f (first args)) (rest args))
    (f args)))

;; Implementing index-by in terms of group-with
(into {} (map (juxt first identity)) data)
(-> (group-by first data) (update-vals last))  ;more clear, but inefficient
(reduce #(assoc %1 (first %2) %2) {} data)
(m/index-by first data)  ;obviously index-by is the most concise

(group-with first (fn [_ b] b) data)
(group-with first (partial unapply second) data)  ;unapply helps this read a little more clearly

;; What if you want first-wins collation instead of index-by's last-wins collation?
(reduce #(if (get %1 (first %2)) %1 (assoc %1 (first %2) %2)) {} data)  ;trickier because you have to deal with the initial nil value case
(reduce #(update %1 (first %2) (fn [a b] (if (some? a) a b)) %2) {} data)
(into {} (reverse (map (juxt [first identity]) data)))
(-> (group-by first data) (update-vals first))  ;more clear, but inefficient

(group-with first (fn [a _] a) data)  ;much better symmetry with the last-wins case
(group-with first (partial unapply first) data)

;; What if you want to pick the winner based on some other criteria? E.g. max-key
(reduce #(assoc %1 (first %2) (max-key count %2 (get %1 (first %2)))) {} data)
(reduce #(update %1 (first %2) (partial max-key count) %2) {} data)
(-> (group-by first data) (update-vals (partial apply max-key count)))  ;more clear, but inefficient

(group-with first (partial max-key count) data)

;; Similar to with first-wins collation, min-key ends up being tricky because (= (count nil) 0), whereas group-with retains a nice symmetry with the max-key version

weavejester commented 8 months ago

That's not a bad idea. I'd change the default initializer, though, and replace get with find in order to account for nil values.

(defn group-with
  ([kf cf coll]
   (group-with kf cf (cf) coll))
  ([kf cf init coll]
   (reduce (fn [m v]
             (let [k (kf v)]
               (assoc m k (if-some [kv (find m k)]
                            (cf init (val kv))
                            init))))
           {}
           coll)))

The reason is that aggregation functions typically have a initial value that can be determined by calling the function with zero arguments:

(conj) => []
(+) => 0
(*) => 1

I also wonder, should the kf be before or after cf?

(group-with first conj [] coll)
;; or
(group-with conj [] first coll)

I guess following transduce it would make sense for init to be third, which would mean cf would be second, but I'd be interested in hearing your opinion.

maxrothman commented 8 months ago

replace get with find in order to account for nil values

Agreed, that's an improvement.

change the default initializer...aggregation functions typically have a initial value that can be determined by calling the function with zero arguments

While I see the elegance in that approach, I think it would limit the utility of the function. Several of the examples in the top post would no longer work (or would require more complicated collation functions to work), such as the following:

;; Pick winner based on max-key
(group-with first (partial max-key count) data)

;; Pick first found rather than last found like index-by
(group-with first (fn [a _] a) data)

It's true that in most functions with multiple arities, the shorter arities act as shorthands for the longest one, but the proposed function more follows the model of reduce, where the arities have slightly different functionality. One way to think about it is that the arity-4 form performs aggregation while the arity-3 form performs selection.

Though the arity-2 form of reduce is often maligned, I think in this case it makes sense to follow that model. The reason reduce/2 is "bad" is because apply can do everything it does better because all of the interesting binary-ish functions take varargs. With group-with/3, the collation function is a little like a comparator: it takes 2 things of the same type and picks one. It's more like a classical binary reduction (reduce op [x1 x2 x3 ...] -> x1 op x2 op x3 op ...)

I also wonder, should the kf be before or after cf?

I haven't thought about it deeply, but to me it seems natural for cf to be next to init, and for init to be next-to-last, which implies the order kf cf init coll. Which of these feels more natural to you?

Index by first, collating by conj with initial value []
Collate using conj, index by first, for collation use the initial value []

Even disregarding the awkwardness of having to say collate twice, I feel like the first matches my mental information hierarchy better.

Finally, I'm interested in your thoughts on what to call this function. group-with? index-with? collate-by?

weavejester commented 8 months ago

I've given it some thought, and I believe you're right regarding the 3-arity form. The parallels between the 3-arity group-with and the 2-arity reduce make sense.

Regarding what to call it, group-with makes some sense. group-by-with would perhaps be more accurate but doesn't read well. We could also call it collate or collate-by - the latter you suggested as well. My inclination is to call it collate-by in order to fit with group-by and index-by, while giving the impression of a more generate purpose function.

maxrothman commented 8 months ago

I agree collate-by is an accurate description, and correctly puts it in a category with index-by and group-by. My only hesitation is, do most potential users know what "collate" means, and do they associate it with the problem they would use the function to solve? Like, if what I knew I was looking for was "a more general group-by" and I ctrl+f'd medley's docs, that wouldn't necessarily turn up collate-by. But is this a real issue? ¯\_(ツ)_/¯

So what's the next step, should I put a PR together? I confess I don't have a ton of experience with optimizing functions like this, but I'm open to learning.

weavejester commented 8 months ago

If we add "group-by" to the docstring, it would turn up the right function if you Ctrl-F'd the docs.

Feel free to put together a PR. Take a quick look over the CONTRIBUTING.md document if you can - it's not very long.

In terms of optimising, reduce is an efficient iterator, so the only efficiency gain I can see beyond that is to use a transient map:

(defn- collate-by* [kf cf initf coll]
  (persistent!
   (reduce (fn [m v]
             (let [k (kf v)]
               (assoc! m k (if-let [kv (find m k)] 
                             (cf (val kv) v)
                             (initf v)))))
           (transient {}) 
           coll)))

(defn collate-by
  ([kf cf coll]
   (collate-by* kf cf identity coll))
  ([kf cf init coll]
   (collate-by* kf cf #(cf init %) coll)))

However, now that I write that out, I'm wondering if having an initf function instead of an init value would be a more general purpose function. It doesn't follow the pattern of reduce, but unlike reduce we can't easily get the first element of each group. That is, we know that (reduce f coll) is the same as (reduce f (first coll) (rest coll)), but we can't easily do this for collate-by.

Transducing functions do have a 1 argument form, i.e. (conj x) => x , but in a transducer this is for being called on the final result, rather than the initial value, so the purpose is different.

So would this be better?

(defn collate-by
  ([keyf collatef coll]
    (collate-by keyf collatef identity coll))
  ([keyf collatef initf coll]
   (persistent!
    (reduce (fn [m v]
              (let [k (keyf v)]
                (assoc! m k ((if-let [kv (find m k)] 
                              (collatef (val kv) v)
                              (initf v)))))
            (transient {}) 
            coll))))

The advantage of this design is that the 3-arity functionality is a subset of the 4-arity functionality, and it potentially opens up more use-cases (though I'm unsure what they would be).

The disadvantage is that you'd need to write:

(collate-by kf cf #(cf init %) coll)

But most collections have functions for this already, e.g.

(collate-by kf conj vector coll)    ;; group in vectors
(collate-by kf conj hash-set coll)  ;; group in sets
(collate-by kf conj list coll)      ;; group in lists

maxrothman commented 8 months ago

If we add "group-by" to the docstring, it would turn up the right function if you Ctrl-F'd the docs.

Good point, we should do the same with index-by too.

However, now that I write that out, I'm wondering if having an initf function instead of an init value would be a more general purpose function...

To me it seems like the tradeoff here is between familiarity/clarity and generality. On one hand, people already know how reduce works, and [] is a little easier on the eyes than vector. Also, not having to use cf in two places makes it easier to use anonymous functions. On the other, an init function is more general.

My instinct says that though the initf approach is more general, if we can't come up with a single usecase for it then the demand for it would be pretty rare, which tips the balance towards familiarity/clarity in my view. The only thing I could come up with was that if there was some collection from a library or something that had no 0-arity constructor, then it would be useful. But that seems pretty exotic, and in every other case, using the 0-arity constructor would suffice.

weavejester commented 8 months ago

I'll give it some thought - there's no need for a hasty decision - but I'm currently leaning toward initf.

In either case we'll need a function that uses initf, as that's the common functionality between the two arities. The only question is whether or not we make that function public, or wrap it in another function that replaces initf with init.

My inclination is to avoid wrappers that make the API more familiar, at the cost of distancing ourselves from the underlying implementation. This seems a case where initf would be "simple" and init would be "easy", and in general we should favor the former over the latter.

maxrothman commented 7 months ago

I was reminded of this issue today, and figured i'd check in! Shall we get the ball rolling again?

maxrothman commented 7 months ago

Having thought more about the initf vs init question, I realized one material advantage of initf over init: it supports collating into mutable types (e.g. ArrayList, JS arrays, etc.). I think the use case would be rare, but when you needed it, it would be nice to have it there.

On the other hand, the major inconvenience of having to write out #(cf init %) I think would come not from writing it out per se, but from having to bind cf to a name so you could reference it in the cf and initf args. I do think that inconvenience is significant though, since it forces you to bind an otherwise small anonymous fn to a name and add another level of nesting to the code.

So I think the important question is: how often would it come up? If a type has a constructor that takes at least one argument, then you can always use the constructor rather than writing out an anonymous fn. All Clojure types have constructors like that (even numbers via * and +, yay monoids), and the only one I've seen in the wild that doesn't is this interval set library, which provides empty and mark fns but no constructor for non-empty sets. There very well might be more, but they're likely to also be somewhat exotic library-provided types.

So this leaves us with a tradeoff between an uncommon use case and another uncommon use case, but at least now I think the tradeoffs are concrete.

I think there are also two possible compromises in addition to the previously discussed approaches:

Check the value of init with fn? and wrap it in a fn if it isn't already in one. This could possibly create a moment of confusion if someone tried to aggregate fns (e.g. with comp) but this seems so exotic as to not be significant. It's slightly magic, but maybe not so bad?
Use kwargs, i.e. :init [] or :initf vector. The newish unified call syntax makes this really nice to do, though users of pre-1.11 Clojure wouldn't have that nicety. I think there's also a midway version of this approach: have a 3-arity that accepts an :initf kwarg and a 4-arity with a static init arg. That would preserve the convenience of a regular old init arg for the vast majority of usecases, and let users break out the initf kwarg in the small number of cases where it was actually needed.

Thoughts?

weavejester commented 6 months ago

Thanks for the update, and apologies for the delay in getting back to you. After giving it some thought, my preference leans toward initf implementation. This may be less convenient under certain circumstances, but under most use-cases there should be no difference, and it's the simpler implementation - insofar that the init version would essentially be a wrapper over the initf version. I hadn't considered that it would work better with mutable collections, but you're right, and that does seem like another minor point in favour of initf.

maxrothman commented 6 months ago

Sounds good, do you want me to submit a PR or would you prefer to do it? I think the code is pretty much already in this issue.

weavejester commented 6 months ago

I should be able to do it. The hard part is not the code, now, but the tests :)