Replace `HsSrcExts` backend with own `Preprocessor` backend

TravisCardwell commented 2 days ago

We have decided to stop using the haskell-src-exts package. We have already started using our own AST representation, since haskell-src-exts is missing some needed features. We currently use haskell-src-exts to render the source code when in preprocessor mode, but we found that we have little control over how the code is formatted. For example, Pretty output does not support documentation, and while haskell-src-exts-sc could work for us, it would require implementing our own pretty-printer (fork of Pretty). (See https://github.com/well-typed/hs-bindgen/issues/26#issuecomment-2418245979 for details.) At that point, we may as well implement a pretty-printer for our own AST that supports documentation, as that would likely be much easier to maintain in the long run.

Note that we only have to pretty-print the code that we generate (a subset of Haskell syntax), not develop a pretty-printer that handles all cases. We do not have to parse Haskell code.

Here are some other options that we considered:

We could generate code with poor formatting and then fix the formatting with a code formatter. Code formatters tend to be pretty big dependencies, however. Adding support for a new release of GHC would depend on support by the formatter.
ghc-exactprint is an option. It uses GHC types, however, and supporting a range of GHC versions would require a lot of maintenance.

Here are some code formatters for reference/inspiration/discouragement:

ormolu and fourmolu use ghc-lib-parser
brittany (unmaintained) uses ghc-exactprint
hindent uses ghc-lib-parser
hfmt (unmaintained) uses haskell-src-exts
haskell-formatter (deprecated) uses haskell-src-exts
stylish-haskell (which only formats some things, not the whole source) uses ghc-lib-parser or ghc types, depending on build flags and the GHC version

Note that the maintained code formatters avoid haskell-src-exts due to various issues.

TravisCardwell commented 2 days ago

It sounds like we will design our AST to use a type family to determine the types of annotations to different parts of the AST for different stages/passes of the translation, similar to the technique described in Trees that Grow.

One generally writes a formatter that supports all stages/passes, so that it can be used to format error messages in intermediate stages as well as the final code. I am not yet sure what is best in our case, however, as I do not think we have a concrete plan for the translation passes (documented) yet. When we got to this, here are some things to consider:

Should we support formatting the AST for all passes, or will it be sufficient to only format the final representation?
Will the representation of the initial pass be annotated with documentation that has already been translated to Haddock?

TravisCardwell commented 2 days ago

We need to decide what to do about formatting rules ("style").

We could provide options that allows users to tweak the style according to their tastes, but doing so would make the formatting implementation more complicated.

We could support different styles by name, which would allow users to select the style that they dislike least.

How important is this? Perhaps we do not need to worry about style options so much? Perhaps users who use/prefer existing formatters can simply run their preferred formatter on the generated code?

Based on what I know at this time, I think it might be preferable to create a single style with no options. That would be the easiest to maintain as we implement features during initial development. We can organize the code to allow for other styles (perhaps with options) that may be implemented in the future.

If we do this, we need to pick a style. We could render the same style that we are writing hs-bindgen in, which I think is Edsko's style. (Perhaps in HsBindgen.Backend.PP.Render.Edsko) I do not know formatting rules for some things for which I do not have examples for, however. Another option is to render a (similar) style with simplified rules that make it easier to automate. (Perhaps in HsBindgen.Backend.PP.Render.Simple)

TravisCardwell commented 2 days ago

Should we take character widths into account when formatting code?

In particular, the widths of characters determine how code is formatted when using a maximum line length. With Unicode text, the maximum line length is generally specified in columns/cells, not characters. For example, identifiers in Chinese may result in lines that are well over 80 columns even when there are fewer than 80 characters per line.

When comments are aligned alongside code, ignoring differences in character width often results in misalignment. We will not have to worry about this if we put all comments on separate lines.

Note that character width is not specified as a Unicode property, as fonts have leeway. (For example, the reference mark (U+203B) 「※」 is often problematic because it is displayed using one column in some fonts and two columns in other fonts. We will not run across this character, which is not valid in C or Haskell identifiers.) I usually reference Vim.

Perhaps we do not want to worry about such details now but may revisit them in the future. In that case, we can use a function that gets the width of text with an initial implementation that just returns the number of characters, which can later be changed. The function should be used consistently, avoiding counting characters anywhere else in the code.

well-typed / hs-bindgen

Replace `HsSrcExts` backend with own `Preprocessor` backend #231