well-typed / hs-bindgen

Automatically generate Haskell bindings from C header files
20 stars 0 forks source link

Replace `HsSrcExts` backend with own `Preprocessor` backend #231

Open TravisCardwell opened 2 days ago

TravisCardwell commented 2 days ago

We have decided to stop using the haskell-src-exts package. We have already started using our own AST representation, since haskell-src-exts is missing some needed features. We currently use haskell-src-exts to render the source code when in preprocessor mode, but we found that we have little control over how the code is formatted. For example, Pretty output does not support documentation, and while haskell-src-exts-sc could work for us, it would require implementing our own pretty-printer (fork of Pretty). (See https://github.com/well-typed/hs-bindgen/issues/26#issuecomment-2418245979 for details.) At that point, we may as well implement a pretty-printer for our own AST that supports documentation, as that would likely be much easier to maintain in the long run.

Note that we only have to pretty-print the code that we generate (a subset of Haskell syntax), not develop a pretty-printer that handles all cases. We do not have to parse Haskell code.

Here are some other options that we considered:

Here are some code formatters for reference/inspiration/discouragement:

Note that the maintained code formatters avoid haskell-src-exts due to various issues.

TravisCardwell commented 2 days ago

It sounds like we will design our AST to use a type family to determine the types of annotations to different parts of the AST for different stages/passes of the translation, similar to the technique described in Trees that Grow.

One generally writes a formatter that supports all stages/passes, so that it can be used to format error messages in intermediate stages as well as the final code. I am not yet sure what is best in our case, however, as I do not think we have a concrete plan for the translation passes (documented) yet. When we got to this, here are some things to consider:

TravisCardwell commented 2 days ago

We need to decide what to do about formatting rules ("style").

We could provide options that allows users to tweak the style according to their tastes, but doing so would make the formatting implementation more complicated.

We could support different styles by name, which would allow users to select the style that they dislike least.

How important is this? Perhaps we do not need to worry about style options so much? Perhaps users who use/prefer existing formatters can simply run their preferred formatter on the generated code?

Based on what I know at this time, I think it might be preferable to create a single style with no options. That would be the easiest to maintain as we implement features during initial development. We can organize the code to allow for other styles (perhaps with options) that may be implemented in the future.

If we do this, we need to pick a style. We could render the same style that we are writing hs-bindgen in, which I think is Edsko's style. (Perhaps in HsBindgen.Backend.PP.Render.Edsko) I do not know formatting rules for some things for which I do not have examples for, however. Another option is to render a (similar) style with simplified rules that make it easier to automate. (Perhaps in HsBindgen.Backend.PP.Render.Simple)

TravisCardwell commented 2 days ago

Should we take character widths into account when formatting code?

In particular, the widths of characters determine how code is formatted when using a maximum line length. With Unicode text, the maximum line length is generally specified in columns/cells, not characters. For example, identifiers in Chinese may result in lines that are well over 80 columns even when there are fewer than 80 characters per line.

When comments are aligned alongside code, ignoring differences in character width often results in misalignment. We will not have to worry about this if we put all comments on separate lines.

Note that character width is not specified as a Unicode property, as fonts have leeway. (For example, the reference mark (U+203B) 「※」 is often problematic because it is displayed using one column in some fonts and two columns in other fonts. We will not run across this character, which is not valid in C or Haskell identifiers.) I usually reference Vim.

Perhaps we do not want to worry about such details now but may revisit them in the future. In that case, we can use a function that gets the width of text with an initial implementation that just returns the number of characters, which can later be changed. The function should be used consistently, avoiding counting characters anywhere else in the code.