Closed transducer closed 1 year ago
I'm not sure partition-when
is the best name for a function like this. To me, -when
suggests a single argument predicate. I think I'd also like to see some more practical use-cases on what this function could be used for. I'm open to the idea of this function in theory, though...
From a code perspective, the seq version of the function shouldn't use a volatile. The previous argument can be passed via an argument to the recursive function used to create the tail of the lazy seq.
I agree that -when
implies a single argument. In the macro definition of when
the argument is an expression called test
that is evaluated, but in Clojure's halt-when
it's passed a unary pred
. split-with
also takes a unarypred
, but pred
is used as a binary predicate in comparator
((pred x y)
) and condp
((pred test-expr expr)
). A suggestion for an analogous name then would be partition-with
. I'll use partition-with
below. And maybe partition-before-with
and partition-after-with
is even more approriate to support the various splitting options, consistent with partition-before
and partition-after
.
For use cases here are some examples, from the Ask Clojure question:
I'm trying to identify where a sensor's data collection was interrupted (e.g. power outage) by examining the difference between the timestamps of the data [so that
[1 2 3 7 8 14 15 16 17 20 21 22]
becomes[[1 2 3] [7 8] [14 15 16 17] [20 21 22]]
]
and similar the Stackoverflow question use case is:
to partition an array of sorted integers into contiguous partitions [so that
[1 2 3 8 9 10]
becomes[[1 2 3], [8 9 10], [99]]
]
in the above two examples partition-with
would work like:
(partition-with (fn [x y] (> y (inc x))) [1 2 3 7 8 14 15 16 17 20 21 22])
;; => ((1 2 3) (7 8) (14 15 16 17) (20 21 22))
(partition-with (fn [x y] (> y (inc x))) [1 2 3 8 9 10 99])
;; => ((1 2 3) (8 9 10) (99))
There's an equivalent Ruby function called slice_when
, that provide use cases in their documentation. I translated them to Clojure's equivalents below.
Chunking a one-by-one increasing subsequence:
(->> [1 2 4 9 10 11 12 15 16 19 20 21]
(partition-with (fn [x y] (not= (inc x) y)))
(mapcat (fn [c] (if (< (count c) 3) c [(format "%s-%s" (first c) (last c))])))
(string/join ","))
;; => "1,2,4,9-12,15,16,19-21"
Near elements (threshold: 6) in sorted array can be chunked as follows:
(partition-with (fn [x y] (< 6 (- y x))) [3 11 14 25 28 29 29 41 55 57])
;; => ((3) (11 14) (25 28 29 29) (41) (55 57))
Increasing (non-decreasing) subsequence can be chunked as follows:
(partition-with > [0 9 2 2 3 2 7 5 9 5])
;; => ((0 9) (2 2 3) (2 7) (5 9) (5))
Adjacent evens and odds can be chunked as follows:
(partition-with (fn [x y] (not= (even? x) (even? y))) [7 5 9 2 0 7 9 4 2 0])
;; => ((7 5 9) (2 0) (7 9) (4 2 0))
/edit This can be more succinctly expressed using partition-by
, namely: (partition-by even? [7 5 9 2 0 7 9 4 2 0]) ;; => ((7 5 9) (2 0) (7 9) (4 2 0))
.
Paragraphs (non-empty lines with trailing empty lines) can be chunked as follows:
(->> ["foo\n" "bar\n" "\n" "baz\n" "qux\n"]
(partition-with (fn [l1 l2] (and (re-matches #"\n" l1) (re-matches #".*\n" l2)))))
;; => (("foo\n" "bar\n" "\n") ("baz\n" "qux\n"))
(this one is different to Ruby's example, since it checks a trailing empty line followed by non-empty lines, we would need a partition-after-with
for this.)
Also in Clojurians Slack functions like partition-when
or partition-with
are mentioned a few times.
Note that cgrand, the creator of the xform library, replies on the suggestion of partition-with
:
The thing with
partition-when
is that it deals with delimiters and you then have a lot of choices: should delimiters be returned? (and if yes how? As single items? Sequence? Appended to the previous partition? Prepended to the next partition?) How delimiters runs are treated?
So either a single function with an options map or many functions. Or you find a way to compose it out of other fns.
In the current implementation it splits between where the predicate is true, but before or after would also be a reasonable choice (?) and these two variants exist already in medley (and around would maybe be an addition). For consistency this would imply functions like partition-before-with
and partition-after-with
(and possibly partition-around
and partition-around-with
).
Regarding your last remark:
The previous argument can be passed via an argument to the recursive function used to create the tail of the lazy seq.
In the current implementation it's not the previous argument to the recursive function that creates the tail of the lazy-seq, it's the previous argument passed to take-while
inside the function to create the tail. I'll first give this a shot. :)
As described in earlier comment I investigated the various types of partitioning (before and after). I find it difficult to make an implementation for something like partition-after-with
, because you need to access the element after the current one to know if you should make a cut on the first element (e.g., (sequence (partition-after-with <) [1 2 3 4 5])
should return ''([1] [2] [3] [4] [5])
and not '([1 2] [3] [4] [5])
, but I find this impossible to implement because for the first element you need to know that the element that follows the current will partition or not, and you don't have access to it. Maybe I'm missing something, but because of this difficulty, I think only having a default -before version makes sense.
I tried to make an implementation based on passing the previous item as an argument, but I don't see how this can be accomplished in a performant manner. If you have a suggestion on how to implement this let me know. I am curious to see.
If you think this is al not viable this PR can be closed.
I think partition-after-with
is likely too specific anyway.
With regard to the seq version, I was thinking something like this:
(defn partition-with [pred coll]
(letfn [(take-part [prev coll]
(lazy-seq
(when-let [[x & xs] (seq coll)]
(when-not (pred prev x)
(cons x (take-part x xs))))))]
(lazy-seq
(when-let [[x & xs] (seq coll)]
(let [run (take-part x xs)]
(cons (cons x run) (partition-with pred (drop (count run) xs))))))))
(I haven't thoroughly tested this, though.)
Really clever, thanks. I added it with one change: I wrapped (drop (count run))
in a lazy-seq
.
The reason is the same as in partition-by
, namely CLJ-1764 ("partition-by
counts and thus realizes each current partition to call itself recursively"). Equivalent example:
(def x (partition-with (fn [x y] (zero? x)) (range)))
(first x)
(first (second x)) ;; <- goes in infinite loop without the `lazy-seq`.
;; => 1
The two other lazy-seqs you added around the body of take-part
and after the let
make also the partition lazy, which is an improvement over the previous implementation:
(def res
(m/partition-with (fn [x y] (println [x y]) (> y 100)) (range 200)))
(nth (first res) 5) ; => 0 (only compares up to [4 5])
Besides that it's also slightly faster when measured with (criterium.core/bench (vec (partition-with < (range 1e3))))
.
As a quick note, please ensure the commit message is plaintext, not markdown.
Thanks. I processed your review comments. Code and names were based on dedupe
and partition-by
, but I agree some parts were redundant and that your suggestions are more clear. To not shadow prev
I called the dereffed version p
. Let me know if you prefer the earlier pval
or have another suggestion. I decided against prev-val
or prior
because the line then goes over 80 characters. Or maybe call p
prev
and give the volatile a bigger name?
Regarding adding or not, take your time to mull it over :) I think it's a nice and useful addition. On the other hand in clojure.core I have not seen any other functions that take a pred over a pair of adjacent items. I think only dedupe
and partition-by
compare previous values, but they use a fixed predicate (respectively equals and not-equals). Then again, why not? And partition-before
, partition-after
and partition-with
do nicely line up with Ruby's slice_before
, slice_after
and slice_when
.
NB I tried to change the implementation of the seq version, analogous to dedupe
and dedupe-by
, to ([pred coll] (sequence (partition-with pred) coll))
. Criterium shows it has a roughly 20% better performance, but the laziness is lost. The first partition is always realized and, for example, (ffirst (m/partition-with > (range)))
goes in an infinite loop. So this was not a better implementation. Probably same reason as to why partition-by
is not implemented like that either ((first (my-partition-by zero? (range))) ; infinite loop
).
In case this is helpful for deciding on a good name, the (almost) same function in flatland.useful.seq is called partition-between
https://cljdoc.org/d/org.flatland/useful/0.11.6/api/flatland.useful.seq#partition-between
That is very helpful, thanks! I like partition-between
, as "between" implies two consecutive values, and fits in more naturally with partition-before
and partition-after
. It is a little longer than partition-with
, but I think that's a reasonable compromise.
Thanks, I have changed the name to partition-between
. I needed to add a newline at two places to prevent lines going over 80 characters and shortened two test inputs from [1 2 3 4 5
] to [1 2 3 4]
for the same reason.
This function does look like a superset of the functionality of partition-before
and partition-after
. Meaning we could redefine them in terms of it, performance allowing:
(defn partition-before
([pred] (partition-between (fn [_ x] (pred x))))
([pred coll] (partition-between (fn [_ x] (pred x)) coll)))
(defn partition-after
([pred] (partition-between (fn [x _] (pred x))))
([pred coll] (partition-between (fn [x _] (pred x)) coll)))
/edit It actually seems more performant for partition-before:
medley.core> (let [test-coll (shuffle (range 500))]
(->> (bench/benchmark-round-robin [(dorun (partition-before odd? test-coll))
(dorun (partition-before-new odd? test-coll))]
bench/*default-benchmark-opts*)
(map :mean)))
([9.553932864338904E-5 (9.521377361241475E-5 9.609587795037858E-5)]
[6.526032066012073E-5 (6.506766188578595E-5 6.551461647432628E-5)])
But a little worse (maybe not significant?) for partition-after:
medley.core> (let [test-coll (shuffle (range 500))]
(->> (bench/benchmark-round-robin [(dorun (partition-after odd? test-coll))
(dorun (partition-after-new odd? test-coll))]
bench/*default-benchmark-opts*)
(map :mean)))
([5.4655768579906435E-5 (5.448247826989242E-5 5.486496964666048E-5)]
[6.317321974920002E-5 (6.294439896625525E-5 6.342228236815859E-5)])
This function does look like a superset of the functionality of
partition-before
andpartition-after
. Meaning we could redefine them in terms of it, performance allowing
Yes, I noticed that as well, and I was kicking myself for not thinking of this function when partition-before
and partition-after
were submitted. I may have reconsidered their inclusion if partition-between
came first... though perhaps partition-before
and partition-after
are useful enough to warrant inclusion anyway.
I'm not sure if there are any potential edge conditions that may make them behave differently - I haven't found any, but I haven't extensively looked, either.
With regard to the benchmarks, that's very interesting. How does the transducer version compare? My guess is that the transducer versions for the current implementation should slightly outperform the transducer version of an implementation based on partition-between
.
Correct:
medley.core> (let [test-coll (shuffle (range 500))]
(->> (bench/benchmark-round-robin [(dorun (into [] (partition-before odd?) test-coll))
(dorun (into [] (partition-before-new odd?) test-coll))]
bench/*default-benchmark-opts*)
(map :mean)))
([2.2860320464171494E-5 (2.263882904997551E-5 2.3114705628187397E-5)]
[2.653058937697948E-5 (2.6303527633302964E-5 2.6779686969065904E-5)])
medley.core> (let [test-coll (shuffle (range 500))]
(->> (bench/benchmark-round-robin [(dorun (into [] (partition-after odd?) test-coll))
(dorun (into [] (partition-after-new odd?) test-coll))]
bench/*default-benchmark-opts*)
(map :mean)))
([2.5865390321092607E-5 (2.5597964750243003E-5 2.61791602106295E-5)]
[2.8569139186355313E-5 (2.8261014440476192E-5 2.890015513141026E-5)])
Thanks for merge @weavejester and thanks for insightful comments both you and @tomdl89. Happy to have been able to suggest this contribution and learn from the process.
In, for example, this Ask Clojure question and this Stackoverflow question there's asked for the partitioning of a sequence based on comparing consecutive values.
partition-when
implements this.The implementation is based on Clojure's
partition-by
and Medley'spartition-after
andpartition-before
.Note that also the non-transducer two-arity implementation is stateful to keep it performant and lazy.