weavejester / cljfmt

A tool for formatting Clojure code
Eclipse Public License 1.0
1.11k stars 120 forks source link

Add namespace sorting #251

Closed or closed 2 years ago

or commented 2 years ago

An attempt to add namespace sorting in ns reference blocks. This is one of the features requested in #13.

I saw it was previously implemented in #95 and #113, but both PRs appear stale and neither looks like they handle the cases I'm interested in. Most importantly handling comments and all libspec variants (e.g. a.b.c, [a.b.c ...], "npm-dependency") and all flavours of reference blocks (i.e. :require, :require-macros, :use, :import). This implementation tries to address those issues.

It tries to preserve newlines and spaces and will honour these styles:

(:require [foo] [bar])
(:require [foo]
          [bar])
(:require
  [foo]
  [bar])

It'll also move comments before a line and at the end of a line with the libspec element, e.g.

(:require
 ; comment for c
 c
 a ; comment for a
 b)
->
(:require
 a ; comment for a
 b
 ; comment for c
 c)

A messy mix of newlines and same line libspecs and comments and non-indentation whitespaces in the source file can still result in unexpected results, especially if no other options for indentation and whitespace removal are turned on, but I think those should be rare, and at that point it's really hard to guess what might be the desired formating.

I introduced a new namespace, which might not be what you want, but it seemed like those functions would enlarge cljfmt.core quite a bit, and since there were other ideas for ns-related improvements in #13... perhaps that's a place where they could live.

Finally, all but the last commit are preparation and (I hope) small improvements. They are orthogonal to the feature, so either can be cherry-picked or dropped.

Note: This code also assumes ns-form? works correctly to identify the top-level ns form, as #250 is already working on that I use the function as is.

atomist[bot] commented 2 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

atomist[bot] commented 2 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

weavejester commented 2 years ago

Thanks for the PR. Can you first remove all extraneous commits and changes that are not relevant to the feature being implemented, as per the contributing guidelines. It makes code review much more difficult.

or commented 2 years ago

Done. I'll split the others into a separate PR, if that's ok.

weavejester commented 2 years ago

This looks rather complex for its stated purpose. My first thought is that we'd want functions more like:

(defn- ns-reference? [zloc]
  ...)

(defn- partition-children [zloc p?]
  ...)

(defn- sort-children [zloc p? comparator]
  ...)

If I have time later I'll see if I can put something together that's a little more detailed.

or commented 2 years ago

I'm not sure what predicates you have in mind for the partitioning and sorting, but I understand the idea of a more general sort-children that is called with a partition-fn and key-fn. I've simplified the partition and assembly a bit by removing the special treatment of the first element (:require) and simply giving reference keywords a sort key of [0 value]. That allows a structure similar to what you outlined.

Furthermore, defs and defns are private now where it makes sense, their doc strings are now comments. I adjusted the naming, based on your feedback.

I also added support for reader macros, they now always come first inside the reference form, and multiple ones (which should be rare) are ordered by their string representation.

weavejester commented 2 years ago

I've investigated the problem a little. I think the design you currently have is not something I'd want to merge and maintain. Rather, I was considering a design more like:

(def ^:private ns-reference-symbols
  #{:import :require :require-macros :use})

(defn- ns-reference? [zloc]
  (and (z/list? zloc)
       (some-> zloc z/up ns-form?)
       (-> zloc z/sexpr first ns-reference-symbols)))

(defn- update-children [zloc f]
  (z/replace zloc (n/replace-children (z/node zloc) (f (z/down zloc)))))

(defn- partition-form [zloc f]
  (->> (iterate z/right* zloc)
       (take-while some?)
       (partition-by f)
       (map (partial map z/node))))

(defn- semantic-boundary? [zloc]
  ;; some logic to work out how to group comments with elements
  )

(defn- sort-arguments-by [zloc f]
  (let [[head & args] (partition-form zloc semantic-boundary?)]
    (->> args (sort-by f) (cons head) (apply concat))))

(defn- sort-references [zloc]
  (update-children zloc #(sort-arguments-by % n/string)))

(defn sort-ns-references [form]
  (transform form edit-all ns-reference? sort-references))

This is currently incomplete and only a rough sketch. I need to consider further on how to intelligently sort elements while maintaining spacing but bringing along comments. Possibly some preprocessing is required to add dummy nodes after comments, so that comments can be rearranged without affecting newlines.

or commented 2 years ago

Cool, the refactoring into multiple functions certainly helps. I've changed the current draft to be close to that outline and moved it all into core again. Also renamed it to sort-ns-references.

It's mostly the the partition function that is different now (and the sort key). I like your approach there, but I don't see a good way to do the split with partition-by. I also had to introduce a whitespace or newline node for the last element, so it can be plugged into the middle somewhere. Would be awesome if you find a way where that isn't necessary.

or commented 2 years ago

I had another go at it, now using a partition-at (which uses partition-by) with a predicate that determines whether a zloc starts another libspec element:

(defn- new-libspec-with-padding? [zloc]
  (or (first-line-break-after-libspec? (z/left* zloc))
      (libspec-without-line-breaks-since-last-libspec? zloc)))

Postponing z/node until after the sorting allows using zloc based predicates and movement in reference-sort-key, which also simplifies some things.

Some of the naming still feels a bit verbose, and ensure-final-whitespace and dedupe-whitespace-nodes remain unwieldy, but I haven't found a way to get rid of them.

weavejester commented 2 years ago

It's better, but I think the design still needs some consideration. My thought is that we want to determine the significant groups in a form that can be reordered, while keeping every other part of the form the same. So for example:

(:require
 ; comment for c
 c
 a ; comment for a
 b)

The groups would be:

(:require
 X
 Y
 Z)

Where:

X="; comment for c
c"
Y="a ; comment for a"
Z="b"

So we'd need to determine significant vs insignificant; given a particular zloc, is it part of the sort or not? I'll see if I can find time to implement it this week.

or commented 2 years ago

I see your point, but I think that complicates matters a lot for what seems to be little gain.

Just to make sure I understand it correctly, I'm thinking of a case like this:

(:require
  c
    a
      b)

which in my version would become

(:require
    a
      b
  c)

But with your approach of interleaving it with existing indentation, it should become

(:require
  a
    b
      c)

That makes sense to me and I explored something like that as well, but I ran into problems, especially with comments, because usually their own indentation remains untouched, and it's not obvious how the indentation between comment and libspec should be treated/changed. I think you hinted at that with your example of X, which removed the indentation of c, presumably with the intention of matching its indentation to that of its comment?

I'm all for it, if you can improve it, but let me outline my thoughts and some examples I already pondered, perhaps it'll affect your view on the matter.

The biggest "normal" case:

(:require
 b
    ; comment for a 
 a
 c)

I'd expect this to be reformatted to:

(:require
    ; comment for a 
 a
 b
 c)

but I suspect with your approach it would become something like

(:require
 ; comment for a 
 a
    b
 c)

A similar problem comes up when the libspec indentation steps out of line:

(:require
 b
 ; comment for a 
    a
 c)

in my mind should become

(:require
 ; comment for a 
    a
 b
 c)

but with interleaving would probably have to reset a back to the comment indentation? Maybe there could be some clever thing that maintains the relative indentation to the comment, but how would that work if the offset is negative?

I can also imagine someone using indentation to group some requires, perhaps something like:

(:require
 b
 [a]
   [a.foo.a]
   [a.foo.b]
   [a.foo.c])

I think that should become

(:require
 [a]
   [a.foo.a]
   [a.foo.b]
   [a.foo.c]
 b)

but with interleaving would strangely become

(:require
 [a]
 [a.foo.a]
   [a.foo.b]
   [a.foo.c]
   b)

That's why I designed based on the assumption that it more often is correct to treat the indentation of a libspec line (and its surrounding comments) as part of that libspec.

All that being said: I think we can agree that these are all edge cases, unless you have some specific common situations in mind that I haven't considered. I expect these lines to generally have the same indentation (with comments being special sometimes). cljfmt itself will correct the indentation if it is allowed to do, and I suspect that to be the "usual" case. And if I'm not mistaken in those cases both approaches should always yield the same result.

Therefore it seems to me the interleaving adds quite a bit of complexity for a minority of edge cases where it would improve the situation, while there are (I'm guessing) more common situations where it would likely result in unexpected/undesired results.

or commented 2 years ago

Did you find some time to look at this?

weavejester commented 2 years ago

I spent a couple of hours on it, but I wasn't able to come up with a solution I was satisfied with. I'll take another crack at it this week or next.

or commented 2 years ago

Any joy?

weavejester commented 2 years ago

I've thought about it a little, but I haven't invested a lot of time in this yet. This feels like a regular expression problem (in the general sense), as we want to group the following tokens:

(horizonal-space* comment)* horizonal-space* element (comment | newline)?

The question is whether we use something like Spec or another data regex library, or whether we write something out manually.

Alternatively, we assign each token type a unique character, concatenate them into a string, and take advantage of the inbuilt text regex support to concisely work out the group indices.

weavejester commented 2 years ago

So perhaps we start with something like:

(defn re-seq-matcher [re charmap coll]
  {:pre (every? charmap coll)}
  (let [s (apply str (map charmap coll))
        v (vec coll)
        m (re-matcher re s)]
      (letfn [(next-match []
                (when (.find m)
                  {:subvec (subvec v (.start m) (.end m))
                   :start  (.start m)
                   :end    (.end m)}))]
        (take-while some? (repeatedly next-match)))))

Then use that to find the slices of comments/elements we want to treat as a continuous block. We sort according to the string value of the non-whitespace non-comment elements, then replace the existing slices with their sorted equivalent.

We probably also need to add newlines after every comment before the sort, then remove them afterwards. This would ensure newlines are preserved once the comments are moved around.

I'll see if I can come up with a more concrete implementation.

or commented 2 years ago

Cool, I can see how that'd work, but I personally don't see the benefit yet. It seems like it'll be more complicated, without solving a concrete problem. However, I might be wrong on that, so if you can make it work, then I'm all for it.

I also don't think the string value of the non-comment element will work well for sorting. That might be a vector starting with spaces, for instance. I think the sorting in the PR makes sense: put reader macros first, then strings, then symbols either on their own or as first element in vectors, e.g. "[a], b, [c.foo]" rather than grouping them by vector and symbol, maybe then strings for whatever else flew by those nets.

weavejester commented 2 years ago

It seems like it'll be more complicated, without solving a concrete problem.

The idea is that it will be less complex, more concise, more maintainable, and easier to adapt to similar problems, such as sorting maps or sets.

I've created a draft implementation, but I've yet to thoroughly test it.

In this implementation, elements of a form are extracted with comments on the same and any preceding lines, sorted alphanumerically (ignoring whitespace and comments), then spliced back into the original form.

weavejester commented 2 years ago

I've opened a branch for this: https://github.com/weavejester/cljfmt/tree/ns-sorting

I'm currently undecided on the sorting order. I think you might be right about needing more sophisticated sorting, though there may be a way to achieve the same effect more concisely.

or commented 2 years ago

Nice. The regex needs to match more cases: there might be only spaces between libspecs. Or a libspec followed by spaces and a newline. uneval nodes also are treated like a libspec right now, I'd add them to the :comment case. Even tho it is true that it might be a commented out libspec, where sorting would make sense, if it's commented out with ;;, then the behaviour would be different. Unlike comments there might be multiple of them on the same line, tho.

Also need a re-matcher for ClojureScript, but that probably isn't a huge hurdle.

weavejester commented 2 years ago

The regex needs to match more cases: there might be only spaces between libspecs. Or a libspec followed by spaces and a newline.

The regex should handle those instances.

uneval nodes also are treated like a libspec right now, I'd add them to the :comment case. Even tho it is true that it might be a commented out libspec, where sorting would make sense, if it's commented out with ;;, then the behaviour would be different.

Since uneval nodes aren't going to be written in natural language, I don't think it makes sense to treat them as comments in the traditional sense.

Also need a re-matcher for ClojureScript, but that probably isn't a huge hurdle.

An excellent point. Thanks for reminding me.

or commented 2 years ago

What's left to do here to make it pass the tests and get it shipshape? Anything I can help with?

weavejester commented 2 years ago

The two things left to do are to get it working with ClojureScript, and to tweak the sorting order. I have some ideas on the latter, but I've yet to find time to work on this.

atomist[bot] commented 2 years ago

Commit messages of this repository should follow the seven rules of a great Git commit message, as mentioned in the project's contributing guidelines. It looks like there's a few issues with the commit messages in this pull request:

or commented 2 years ago

I've pulled your new design in here, merging it with the docs update.

I've also merged in my previous tests, showing a few that fail now. On top of that I put several proposal commits intended for fixup:

  1. run the sorting as the first step, so subsequent processing has a chance to fix spacing and whatnot
  2. treat :uneval as :comment, which addresses one of the failing tests - it's optional, I guess an argument could be made that uneval nodes are more likely to be libspecs that were commented out, but it feels more consistent to treat them like comments

I've also added a draft of a cljs version of re-seq-matcher, but, thinking that duplicates too much logic, changed that to a matcher object that hides away the clj/cljs implementation details. See whether you like any of these approaches, perhaps as a starting point.

weavejester commented 2 years ago

Thanks for the code. Hiding the implentation is a good idea, but I don't think we need a protocol for it, as there's no open polymorphism. Instead, what about something like:

(defn- re-indexes [re s]
  (let [matcher    #?(:clj  (re-matcher re s)
                      :cljs (js/RegExp. (.-source re) "g"))
        next-match #?(:clj  #(when (.find matcher)
                               [(.start matcher) (.end matcher)])
                      :cljs #(when-let [result (.exec matcher s)]
                               [(.-index result) (.-lastIndex matcher)]))]
    (take-while some? (repeatedly next-match))))

(defn- re-seq-matcher [re charmap coll]
  {:pre (every? charmap coll)}
  (let [s (apply str (map charmap coll))
        v (vec coll)]
    (for [[start end] (re-indexes re s)]
      {:value (subvec v start end)
       :start start
       :end   end})))

Running the namespace sort first rather than last also makes a lot of sense.

or commented 2 years ago

Great, that's much better.

or commented 2 years ago

Did you have a chance to look at the sorting?

weavejester commented 2 years ago

Apologies for the delay. My current thought is to sort by:

(defn- node-sort-string [nodes]
  (-> (remove (some-fn n/comment? n/whitespace?) nodes)
      (nodes-string)
      (str/replace #"[\[\]\(\)\{\}]|\^[:\w-]+" "")
      (str/trim)))

In other words: ignore brackets, metadata, and leading spaces, then sort alphanumerically. This will keep conditionals together, strings together, and also sort namespaces in an intuitive order (e.g. a comes before [b]).

It's a little crude, so I'd appreciate any edge cases you can think of that might cause issues.

or commented 2 years ago

I think that's hard to grok and won't be sufficient to deal with metadata that's map-shaped, I've added a test for that:

(is (reformats-to?
       ["(ns foo.bar"
        "  (:require"
        "   ^{:foo :bar"
        "     :bar :foo}"
        "   [c]"
        "   [a.b :as b]"
        "   b))"]
       ["(ns foo.bar"
        "  (:require"
        "   [a.b :as b]"
        "   b"
        "   ^{:foo :bar"
        "     :bar :foo}"
        "   [c]))"]
       {:sort-ns-references? true}))

The result is

(ns foo.bar
  (:require
   ^{:foo :bar
     :bar :foo}
   [c]
   [a.b :as b]
   b))

Because ^:bar is the string it sorts on after removal of {.

I want to pitch again to have a sorting function that makes the order explicit. It could still use the removal of any irrelevant nodes and use nodes-string, especially as secondary order criterion, but explicitly unwrap metadata nodes, put :uneval on top, followed by string, and then everything else.

or commented 2 years ago

The two remaining tests that fail are due to a mix of newlines and libspecs on the same line, potentially without spaces, which then result in strange indentation and reassembly due to splicing. Please have a look whether you agree with the tests, otherwise the behaviour might be acceptable to you.

Is there really a big advantage in the splicing? I think this regex approach could work well with a pattern that includes all whitespace (and comments) to the left and a newline, if it exists, to the right, so those move with the libspec.

There's a potential change of newline after the last libspec, but that seems like an edge case that can be ignored, since other options take care of it (at least when a newline should be removed). And the more important behaviour is that all indentation and newlines being kept as they are, if the order doesn't change, which is the case.

or commented 2 years ago

One more update: I had to change the regex to allow for more special cases with uneval blocks above the libspec: #"((CS*)+NS*)*E(S*C)*", also added a test for that:

(is (reformats-to?
       ["(ns foo.bar"
        "  (:require"
        "   [foo]"
        "   ;; comment above"
        "   #_[comment above] #_another"
        "   ;; comment above"
        "   [b]))"]
       ["(ns foo.bar"
        "  (:require"
        "   ;; comment above"
        "   #_[comment above] #_another"
        "   ;; comment above"
        "   [b]"
        "   [foo]))"]
       {:sort-ns-references? true}))
weavejester commented 2 years ago

I think that's hard to grok and won't be sufficient to deal with metadata that's map-shaped

Yes, that's true. I was hoping to get away with a regular expression, but I probably want to map over the nodes and remove metadata that way.

I want to pitch again to have a sorting function that makes the order explicit.

I'd rather have something that's easy to remember. "Sorted alphanumerically, ignoring metadata and brackets" is simpler than trying to remember an arbitrary order.

I think its important that functions behave in a way that's predictable and can be concisely explained and recalled. We want to boil down the behaviour into a minimal set of rules.

The two remaining tests that fail are due to a mix of newlines and libspecs on the same line, potentially without spaces, which then result in strange indentation and reassembly due to splicing. Please have a look whether you agree with the tests, otherwise the behaviour might be acceptable to you.

I think that's fine. This sort of edge case is almost always going to be in error, and could potentially be solved by an :indent-comments? rule that would have also have more general use.

Is there really a big advantage in the splicing? I think this regex approach could work well with a pattern that includes all whitespace (and comments) to the left and a newline, if it exists, to the right, so those move with the libspec.

That potentially results in odd behaviour for the first and last elements, and I feel its generally unintuitive behaviour. Ideally we want to start from the simplest possible behaviour, i.e. "it sorts non-whitespace elements", and then add the minimum amount of complexity needed, i.e. "it sorts non-whitespace elements, keeping the relative position of comments above and to the right of each element".

I had to change the regex to allow for more special cases with uneval blocks above the libspec

I think my preference is not to treat uneval blocks the same as comments in this case, as they fulfil different roles. The purpose of a comment is to describe code, while an uneval block is code we don't want to evaluate.

or commented 2 years ago

I think my preference is not to treat uneval blocks the same as comments in this case, as they fulfil different roles. The purpose of a comment is to describe code, while an uneval block is code we don't want to evaluate.

Ok, then let's treat uneval as non-comment. I can see a situation where a libspec is unevaled, so it'd make sense to sort that as well. It would have to be unpacked like the metadata.

I'd rather have something that's easy to remember. "Sorted alphanumerically, ignoring metadata and brackets" is simpler than trying to remember an arbitrary order.

I think its important that functions behave in a way that's predictable and can be concisely explained and recalled. We want to boil down the behaviour into a minimal set of rules.

I understand that and can live with it, but it's still hiding a lot of implicit stuff, I think. For instance, the reader-macros only end up on top because there's a keyword in there that starts with :, if I understand it correctly. Similar with the " of strings. The code doesn't really reveal that.

That potentially results in odd behaviour for the first and last elements, and I feel its generally unintuitive behaviour. Ideally we want to start from the simplest possible behaviour, i.e. "it sorts non-whitespace elements", and then add the minimum amount of complexity needed, i.e. "it sorts non-whitespace elements, keeping the relative position of comments above and to the right of each element".

I now think it might be simpler if cljfmt dictates how the form should look (setting linebreaks and indentation), rather than working so hard to preserve indentation and special cases. But maybe this approach is more re-usable.

Those two concerns aside, I've pushed a new version that removes the uneval changes, adds a test for an unevaled libspec that is expected to be sorted as well, changed/removed all the tests that had super edgy splicing problem cases, e.g. [foo]c, and drafted a function that unpacks metadata and uneval as PoC:

(defn- unpack-metadata-and-uneval [nodes]
  (mapcat (fn [node]
            (case (n/tag node)
              :meta (rest (n/children node))
              :uneval (n/children node)
              [node])) nodes))

(defn- node-sort-string [nodes]
  (-> (remove (some-fn n/comment? n/whitespace?) (unpack-metadata-and-uneval nodes))
      (nodes-string)
      (str/replace #"[\[\]\(\)\{\}]" "")
      (str/trim)))

Feel free to change it as needed. All tests pass now, and I think it's starting to look like a decent solution.

weavejester commented 2 years ago

I can see a situation where a libspec is unevaled, so it'd make sense to sort that as well. It would have to be unpacked like the metadata.

Maybe. I'm in two minds about this. The downside is that the description of the sorting behaviour becomes less concise:

Elements are sorted alphanumerically, ignoring brackets, metadata, and uneval prefixes.

Vs.

Elements are sorted alphanumerically, ignoring brackets and metadata.

Putting in a special case for brackets and metadata makes sense, but I'm unsure if uneval is important enough to warrant making the behaviour more complex. I have a slight inclination to default to the simpler option.

I understand that and can live with it, but it's still hiding a lot of implicit stuff, I think. For instance, the reader-macros only end up on top because there's a keyword in there that starts with :, if I understand it correctly.

Almost: it's because they start with #?.

Similar with the " of strings. The code doesn't really reveal that.

Implicit is fine, as long as its predictable. If I sort alphanumerically, items that start with the same character are grouped together. That might be implicit, but it shouldn't be surprising.

I now think it might be simpler if cljfmt dictates how the form should look (setting linebreaks and indentation), rather than working so hard to preserve indentation and special cases.

So my intent here is simplicity, in the Rich Hickey sense of the word. It might harder to sort while leaving the spacing alone, but it's simpler because there a single purpose to this system. It just sorts. It doesn't make spacing or indentation changes; it only reorders the significant elements in a form.

But maybe this approach is more re-usable.

I can see this being useful for sorting literal maps and sets as well, which may potentially be useful.

or commented 2 years ago

Putting in a special case for brackets and metadata makes sense, but I'm unsure if uneval is important enough to warrant making the behaviour more complex. I have a slight inclination to default to the simpler option.

It's worth considering that using #_ for a libspec or later map key/value pairs or set elements, perhaps only temporarily, would make these lines jump to the top of the form when the editor uses cljfmt on save. In my opinion the intuitive/useful behaviour would be that it behaves similar to metadata.

However, that could be revisited when it is actually needed for map literals, where unevals probably are more likely. And then there's a whole new issue with stacked #_s, which the current version can't handle all too well.

It's not a blocker for me either way, feel free to remove it.