weavejester / medley

A lightweight library of useful Clojure functions
Eclipse Public License 1.0
865 stars 66 forks source link

Add partition-between #74

Closed transducer closed 1 year ago

transducer commented 1 year ago

In, for example, this Ask Clojure question and this Stackoverflow question there's asked for the partitioning of a sequence based on comparing consecutive values. partition-when implements this.

The implementation is based on Clojure's partition-by and Medley's partition-after and partition-before.

Note that also the non-transducer two-arity implementation is stateful to keep it performant and lazy.

weavejester commented 1 year ago

I'm not sure partition-when is the best name for a function like this. To me, -when suggests a single argument predicate. I think I'd also like to see some more practical use-cases on what this function could be used for. I'm open to the idea of this function in theory, though...

From a code perspective, the seq version of the function shouldn't use a volatile. The previous argument can be passed via an argument to the recursive function used to create the tail of the lazy seq.

transducer commented 1 year ago

I agree that -when implies a single argument. In the macro definition of when the argument is an expression called test that is evaluated, but in Clojure's halt-when it's passed a unary pred. split-with also takes a unarypred, but pred is used as a binary predicate in comparator ((pred x y)) and condp ((pred test-expr expr)). A suggestion for an analogous name then would be partition-with. I'll use partition-with below. And maybe partition-before-with and partition-after-with is even more approriate to support the various splitting options, consistent with partition-before and partition-after.

For use cases here are some examples, from the Ask Clojure question:

I'm trying to identify where a sensor's data collection was interrupted (e.g. power outage) by examining the difference between the timestamps of the data [so that [1 2 3 7 8 14 15 16 17 20 21 22] becomes [[1 2 3] [7 8] [14 15 16 17] [20 21 22]]]

and similar the Stackoverflow question use case is:

to partition an array of sorted integers into contiguous partitions [so that [1 2 3 8 9 10] becomes [[1 2 3], [8 9 10], [99]]]

in the above two examples partition-with would work like:

(partition-with (fn [x y] (> y (inc x))) [1 2 3 7 8 14 15 16 17 20 21 22])
;; => ((1 2 3) (7 8) (14 15 16 17) (20 21 22))

(partition-with (fn [x y] (> y (inc x))) [1 2 3 8 9 10 99])
;; => ((1 2 3) (8 9 10) (99))

There's an equivalent Ruby function called slice_when, that provide use cases in their documentation. I translated them to Clojure's equivalents below.

Chunking a one-by-one increasing subsequence:

(->> [1 2 4 9 10 11 12 15 16 19 20 21]
     (partition-with (fn [x y] (not= (inc x) y)))
     (mapcat (fn [c] (if (< (count c) 3) c [(format "%s-%s" (first c) (last c))])))
     (string/join ","))
;; => "1,2,4,9-12,15,16,19-21"

Near elements (threshold: 6) in sorted array can be chunked as follows:

(partition-with (fn [x y] (< 6 (- y x))) [3 11 14 25 28 29 29 41 55 57])
;; => ((3) (11 14) (25 28 29 29) (41) (55 57))

Increasing (non-decreasing) subsequence can be chunked as follows:

(partition-with > [0 9 2 2 3 2 7 5 9 5])
;; => ((0 9) (2 2 3) (2 7) (5 9) (5))

Adjacent evens and odds can be chunked as follows:

(partition-with (fn [x y] (not= (even? x) (even? y))) [7 5 9 2 0 7 9 4 2 0])
;; => ((7 5 9) (2 0) (7 9) (4 2 0))

/edit This can be more succinctly expressed using partition-by, namely: (partition-by even? [7 5 9 2 0 7 9 4 2 0]) ;; => ((7 5 9) (2 0) (7 9) (4 2 0)).

Paragraphs (non-empty lines with trailing empty lines) can be chunked as follows:

(->> ["foo\n" "bar\n" "\n" "baz\n" "qux\n"]
     (partition-with (fn [l1 l2] (and (re-matches #"\n" l1) (re-matches #".*\n" l2)))))
;; => (("foo\n" "bar\n" "\n") ("baz\n" "qux\n"))

(this one is different to Ruby's example, since it checks a trailing empty line followed by non-empty lines, we would need a partition-after-with for this.)

Also in Clojurians Slack functions like partition-when or partition-with are mentioned a few times.

Note that cgrand, the creator of the xform library, replies on the suggestion of partition-with:

The thing with partition-when is that it deals with delimiters and you then have a lot of choices: should delimiters be returned? (and if yes how? As single items? Sequence? Appended to the previous partition? Prepended to the next partition?) How delimiters runs are treated?
So either a single function with an options map or many functions. Or you find a way to compose it out of other fns.

In the current implementation it splits between where the predicate is true, but before or after would also be a reasonable choice (?) and these two variants exist already in medley (and around would maybe be an addition). For consistency this would imply functions like partition-before-with and partition-after-with (and possibly partition-around and partition-around-with).

Regarding your last remark:

The previous argument can be passed via an argument to the recursive function used to create the tail of the lazy seq.

In the current implementation it's not the previous argument to the recursive function that creates the tail of the lazy-seq, it's the previous argument passed to take-while inside the function to create the tail. I'll first give this a shot. :)

transducer commented 1 year ago

As described in earlier comment I investigated the various types of partitioning (before and after). I find it difficult to make an implementation for something like partition-after-with, because you need to access the element after the current one to know if you should make a cut on the first element (e.g., (sequence (partition-after-with <) [1 2 3 4 5]) should return ''([1] [2] [3] [4] [5]) and not '([1 2] [3] [4] [5]), but I find this impossible to implement because for the first element you need to know that the element that follows the current will partition or not, and you don't have access to it. Maybe I'm missing something, but because of this difficulty, I think only having a default -before version makes sense.

I tried to make an implementation based on passing the previous item as an argument, but I don't see how this can be accomplished in a performant manner. If you have a suggestion on how to implement this let me know. I am curious to see.

If you think this is al not viable this PR can be closed.

weavejester commented 1 year ago

I think partition-after-with is likely too specific anyway.

With regard to the seq version, I was thinking something like this:

(defn partition-with [pred coll]
  (letfn [(take-part [prev coll]
            (lazy-seq
             (when-let [[x & xs] (seq coll)]
               (when-not (pred prev x)
                 (cons x (take-part x xs))))))]
    (lazy-seq
     (when-let [[x & xs] (seq coll)]
       (let [run (take-part x xs)]
         (cons (cons x run) (partition-with pred (drop (count run) xs))))))))

(I haven't thoroughly tested this, though.)

transducer commented 1 year ago

Really clever, thanks. I added it with one change: I wrapped (drop (count run)) in a lazy-seq.

The reason is the same as in partition-by, namely CLJ-1764 ("partition-by counts and thus realizes each current partition to call itself recursively"). Equivalent example:

(def x (partition-with (fn [x y] (zero? x)) (range)))
(first x)
(first (second x)) ;; <- goes in infinite loop without the `lazy-seq`.
;; => 1

The two other lazy-seqs you added around the body of take-part and after the let make also the partition lazy, which is an improvement over the previous implementation:

(def res
  (m/partition-with (fn [x y] (println [x y]) (> y 100)) (range 200)))

(nth (first res) 5) ; => 0 (only compares up to [4 5])

Besides that it's also slightly faster when measured with (criterium.core/bench (vec (partition-with < (range 1e3)))).

weavejester commented 1 year ago

As a quick note, please ensure the commit message is plaintext, not markdown.

transducer commented 1 year ago

Thanks. I processed your review comments. Code and names were based on dedupe and partition-by, but I agree some parts were redundant and that your suggestions are more clear. To not shadow prev I called the dereffed version p. Let me know if you prefer the earlier pval or have another suggestion. I decided against prev-val or prior because the line then goes over 80 characters. Or maybe call p prev and give the volatile a bigger name?

Regarding adding or not, take your time to mull it over :) I think it's a nice and useful addition. On the other hand in clojure.core I have not seen any other functions that take a pred over a pair of adjacent items. I think only dedupe and partition-by compare previous values, but they use a fixed predicate (respectively equals and not-equals). Then again, why not? And partition-before, partition-after and partition-with do nicely line up with Ruby's slice_before, slice_after and slice_when.

NB I tried to change the implementation of the seq version, analogous to dedupe and dedupe-by, to ([pred coll] (sequence (partition-with pred) coll)). Criterium shows it has a roughly 20% better performance, but the laziness is lost. The first partition is always realized and, for example, (ffirst (m/partition-with > (range))) goes in an infinite loop. So this was not a better implementation. Probably same reason as to why partition-by is not implemented like that either ((first (my-partition-by zero? (range))) ; infinite loop).

tomdl89 commented 1 year ago

In case this is helpful for deciding on a good name, the (almost) same function in flatland.useful.seq is called partition-between https://cljdoc.org/d/org.flatland/useful/0.11.6/api/flatland.useful.seq#partition-between

weavejester commented 1 year ago

That is very helpful, thanks! I like partition-between, as "between" implies two consecutive values, and fits in more naturally with partition-before and partition-after. It is a little longer than partition-with, but I think that's a reasonable compromise.

transducer commented 1 year ago

Thanks, I have changed the name to partition-between. I needed to add a newline at two places to prevent lines going over 80 characters and shortened two test inputs from [1 2 3 4 5] to [1 2 3 4] for the same reason.

tomdl89 commented 1 year ago

This function does look like a superset of the functionality of partition-before and partition-after. Meaning we could redefine them in terms of it, performance allowing:

(defn partition-before
  ([pred] (partition-between (fn [_ x] (pred x))))
  ([pred coll] (partition-between (fn [_ x] (pred x)) coll)))

(defn partition-after
  ([pred] (partition-between (fn [x _] (pred x))))
  ([pred coll] (partition-between (fn [x _] (pred x)) coll)))

/edit It actually seems more performant for partition-before:

medley.core> (let [test-coll (shuffle (range 500))]
               (->> (bench/benchmark-round-robin [(dorun (partition-before odd? test-coll))
                                                  (dorun (partition-before-new odd? test-coll))]
                                                 bench/*default-benchmark-opts*)
                    (map :mean)))
([9.553932864338904E-5 (9.521377361241475E-5 9.609587795037858E-5)]
 [6.526032066012073E-5 (6.506766188578595E-5 6.551461647432628E-5)])

But a little worse (maybe not significant?) for partition-after:

medley.core> (let [test-coll (shuffle (range 500))]
               (->> (bench/benchmark-round-robin [(dorun (partition-after odd? test-coll))
                                                  (dorun (partition-after-new odd? test-coll))]
                                                 bench/*default-benchmark-opts*)
                    (map :mean)))
([5.4655768579906435E-5 (5.448247826989242E-5 5.486496964666048E-5)]
 [6.317321974920002E-5 (6.294439896625525E-5 6.342228236815859E-5)])
weavejester commented 1 year ago

This function does look like a superset of the functionality of partition-before and partition-after. Meaning we could redefine them in terms of it, performance allowing

Yes, I noticed that as well, and I was kicking myself for not thinking of this function when partition-before and partition-after were submitted. I may have reconsidered their inclusion if partition-between came first... though perhaps partition-before and partition-after are useful enough to warrant inclusion anyway.

I'm not sure if there are any potential edge conditions that may make them behave differently - I haven't found any, but I haven't extensively looked, either.

weavejester commented 1 year ago

With regard to the benchmarks, that's very interesting. How does the transducer version compare? My guess is that the transducer versions for the current implementation should slightly outperform the transducer version of an implementation based on partition-between.

tomdl89 commented 1 year ago

Correct:

medley.core> (let [test-coll (shuffle (range 500))]
               (->> (bench/benchmark-round-robin [(dorun (into [] (partition-before odd?) test-coll))
                                                  (dorun (into [] (partition-before-new odd?) test-coll))]
                                                 bench/*default-benchmark-opts*)
                    (map :mean)))
([2.2860320464171494E-5 (2.263882904997551E-5 2.3114705628187397E-5)]
 [2.653058937697948E-5 (2.6303527633302964E-5 2.6779686969065904E-5)])
medley.core> (let [test-coll (shuffle (range 500))]
               (->> (bench/benchmark-round-robin [(dorun (into [] (partition-after odd?) test-coll))
                                                  (dorun (into [] (partition-after-new odd?) test-coll))]
                                                 bench/*default-benchmark-opts*)
                    (map :mean)))
([2.5865390321092607E-5 (2.5597964750243003E-5 2.61791602106295E-5)]
 [2.8569139186355313E-5 (2.8261014440476192E-5 2.890015513141026E-5)])
transducer commented 1 year ago

Thanks for merge @weavejester and thanks for insightful comments both you and @tomdl89. Happy to have been able to suggest this contribution and learn from the process.