Include Haddocks for exported (low-level) bindings

edsko commented 3 months ago

When we generate foreign imports from C headers, if the declarations in those headers have documentation attached, then we should include that documentation in the generated bindings also. Ideally we would be using Haddock formatting here, but given the absence of a general agreed standard for the documentation of C headers, that might be hard to do in general. Nonetheless, we could at least support common formats.

[ ] Translate Doxygen style.
[ ] As mentioned in #25, it might also be useful to include C variable names in as part of the Haddocks.
[ ] If we generate an export list (#76), and if the C header uses Doxygen sections, it would be nice to use those same sections also in the export list.

TravisCardwell commented 1 month ago

I defined the wrappers in doxygen_wrappers.h as well as the corresponding Haskell API in HsBindgen.Clang.Doxygen and the necessary enumerations in HsBindgen.Clang.Doxygen.Enums, in the doxygen branch.

Of the content in clang-c/Documentation.h, the following is not (yet) implemented:

Struct CXComment attributes ASTNode and TranslationUnit are not implemented. I think that we will traverse the AST top-down, so I doubt we will need these.
Typedef CXAPISet, as well as related functions clang_createAPISet and clang_disposeAPISet, are not implemented. I do not think that we will need these for our purposes.
Functions clang_getSymbolGraphForUSR and clang_getSymbolGraphForCursor are not implemented. I do not think that we will need these for our purposes.

Please feel free to let me know if we will need something that I have not implemented, though I will no doubt find out when I proceed with implementation.

TravisCardwell commented 1 month ago

I am currently implementing an ast-dump utility that walks the AST and dumps information so that I can see exactly how the AST is structured. The LLVM tool works fine, of course; I want to see details via our library.

Just to get started quickly, I started implementing it as a hs-bindgen-libclang test suite, like the tutorial. It could be useful in the long run, though. Perhaps it should be put in a separate package, so that more dependencies can be used. For example, it would probably be worthwhile to implement a CLI parser using optparse-applicative to provide options to select what information is displayed. I am being very verbose with documentation output since that is what I am working on, but such information would get in the way when working on something else.

BTW, output is currently formatted in Markdown

I have not committed the code yet, but I should be able to do so tomorrow.

TravisCardwell commented 1 month ago

I wonder if it is worth adding HsBindgen.Clang.Util.Classification predicates for the documentation API.

edsko commented 1 month ago

Would the existing show-clang-ast command help with this? Or do you need more info than it provides?

TravisCardwell commented 1 month ago

Would the existing show-clang-ast command help with this? Or do you need more info than it provides?

I hadn't seen that yet! Nice!

Currently, I am inspecting the parsed comments in detail.

(WIP) Example

``` * "S4" * cursor type kind: Right CXType_Record * spelling: "Record" * record: "struct S4" * extent start: (4,1) * extent end: (18,2) * comment: "/**\n * A struct with a Doxygen comment\n */" * brief: "A struct with a Doxygen comment" * kind: Right CXComment_FullComment * children: 1 * kind: Right CXComment_Paragraph * children: 1 * kind: Right CXComment_Text * text: " A struct with a Doxygen comment" * "a" * cursor type kind: Right CXType_Char_S * spelling: "Char_S" * extent start: (8,5) * extent end: (8,11) * comment: "/**\n * A field preceded by a Doxygen comment\n */" * brief: "A field preceded by a Doxygen comment" * kind: Right CXComment_FullComment * children: 1 * kind: Right CXComment_Paragraph * children: 1 * kind: Right CXComment_Text * text: " A field preceded by a Doxygen comment" * "b" * cursor type kind: Right CXType_Int * spelling: "Int" * extent start: (10,5) * extent end: (10,10) * comment: "/**< A field followed by a Doxygen comment */" * brief: "A field followed by a Doxygen comment" * kind: Right CXComment_FullComment * children: 1 * kind: Right CXComment_Paragraph * children: 1 * kind: Right CXComment_Text * text: " A field followed by a Doxygen comment " * "c" * cursor type kind: Right CXType_Float * spelling: "Float" * extent start: (17,5) * extent end: (17,12) * comment: "/**\n * A field that refers to another field\n *\n * See also @ref S4::a\n */" * brief: "A field that refers to another field" * kind: Right CXComment_FullComment * children: 2 * kind: Right CXComment_Paragraph * children: 1 * kind: Right CXComment_Text * text: " A field that refers to another field" * kind: Right CXComment_Paragraph * children: 3 * kind: Right CXComment_Text * text: " See also " * kind: Right CXComment_InlineCommand * name: "ref" * render kind: Right CXCommentInlineCommandRenderKind_Normal * args: 1 * 0: "S4::a" * kind: Right CXComment_Text * whitespace: True * text: " " ```

TravisCardwell commented 1 month ago

I finished the initial version of the Clang AST dump utility, now named clang-ast-dump. I put it in a package of the same name, in the doxygen branch. This package is (tentatively) temporary, just used for current development. It can be enabled by creating a symbolic link, as described in the README.

The purpose of the utility is to dump (detailed) information about the Clang AST. The existing show-clang-est command traverses the Haskell types that we create from the Clang data structures, while this utility dumps information from the Clang data structures themselves. Elements are output without hierarchy; the dumped information is displayed hierarchically, in Markdown format.

Various options configure what is dumped. Notably, --same-file can be used to only dump things from the specified file, not things included by that file. Other boolean options configure what types of details are displayed.

Currently, the utility only recurses on the same (four) cursor kinds as HsBindgen.C.Parser.foldDecls. Other cursor kinds output CURSOR_KIND_NOT_IMPLEMENTED; if we see that in the output of something that we are investigating, we know more information may be available.

The comment dumping is complete, since that is what I am working on.

I am now using this utility to investigate how real-world Doxygen comments are parsed by Clang. I will build up a plan for how to translate these to Haddock comments. I am starting with clang-c/Index.h and clang-c/Documentation.h.

edsko commented 1 month ago

Let's discuss this in our call today.

TravisCardwell commented 1 month ago

During the call yesterday, I mentioned that I saw some unhelpful (empty) failures when experimenting with recursing on some more cursor kinds. The failure is not an issue at this time, as we do not support those cursor kinds yet, but we might want to change how we raise the exception so that it is more informative.

I remember that CXCursor_FunctionDecl caused this issue. Here is the error, which has an empty hint.

clang-ast-dump: CallFailed "" CallStack (from HasCallStack):
  collectBacktrace, called at src/HsBindgen/Clang/Internal/Results.hs:41:14 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Internal.Results
  callFailed, called at src/HsBindgen/Clang/Internal/Results.hs:68:12 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Internal.Results
  ensureOn, called at src/HsBindgen/Clang/Internal/Results.hs:54:10 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Internal.Results
  ensure, called at src/HsBindgen/Clang/Core.hs:1055:29 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Core
  clang_getTypeSpelling, called at Main.hs:74:32 in clang-ast-dump-0.0.0.0-inplace-clang-ast-dump:Main

I could link to the code, but those links will very quickly be invalid when I rebase this morning, so I will copy-and-paste.

The above failure occurred when recursing on the clang_Cursor_getParsedComment(CXCursor) function of clang-c/Documentation.h. The cursor is for the default attribute, with cursor kind simpleEnum CXCursor_VisibilityAttr.

* "default"
    * parent: "clang_Cursor_getParsedComment(CXCursor)"
    * extent: ("/usr/include/clang-c/Documentation.h",(47,1),(47,15))
    * cursor kind: simpleEnum CXCursor_VisibilityAttr

Main.hs:72-74:

cursorType <- clang_getCursorType cursor
let typeKind = cxtKind cursorType
traceU 1 "cursor type" =<< clang_getTypeSpelling cursorType

The documentation for clang_getCursorType() states:

Retrieve the type of a CXCursor (if any).

It says "if any" but does not provide details about the "Nothing" case. Perhaps the type kind is CXType_Invalid and I simply need to check for that before attempting to dump any information about the type. That would be trivial to fix.

src/HsBindgen/Clang/Core.hs:1054-1057:

clang_getTypeSpelling :: HasCallStack => CXType -> IO Text
clang_getTypeSpelling typ = ensure (not . Text.null) $
     onHaskellHeap typ $ \typ' ->
       preallocate_$ wrap_getTypeSpelling typ'

That is it! That ensure call fails if the returned Text is empty, and it constructs a CallFailed exception using that empty Text as the hint. That makes perfect sense. By design, CallFailed uses values to communicate errors, and the backtrace provides the context that is required to make sense of an error. Since clang_getTypeSpelling returns Text, I should have realized that the empty string is the error case for a non-null check. I do not think that any changes are needed.

If we would like to provide easier-to-understand errors in such cases, however, perhaps the easiest way would be to define an alternative version of ensure that allows the caller to specify the CallFailed hint.

ensure' :: (HasCallStack, Typeable e, Show e)
  => (a -> Bool)
  -> (a -> e)
  -> IO a
  -> IO a
ensure' = ensureOn' id

ensureOn' :: (HasCallStack, Typeable e, Show e)
  => (a -> b)
  -> (b -> Bool)
  -> (a -> e)
  -> IO a -> IO a
ensureOn' f p mkHint call = do
    x <- call
    if p (f x)
      then return x
      else callFailed (mkHint x)

Example usage:

data CallFailedHint = EmptyResult | NullPtrResult
  deriving Show

ensureNotEmpty :: (HasCallStack, Eq a, Monoid a) => IO a -> IO a
ensureNotEmpty = ensure' (/= mempty) (const EmptyResult)

ensureNotNull :: (HasCallStack, Coercible a (Ptr x)) => IO a -> IO a
ensureNotNull = ensure' (not . isNullPtr) (const NullPtrResult)

clang_getTypeSpelling :: HasCallStack => CXType -> IO Text
clang_getTypeSpelling typ = ensureNotEmpty $
     onHaskellHeap typ $ \typ' ->
       preallocate_$ wrap_getTypeSpelling typ'

TravisCardwell commented 1 month ago

As discussed in the call yesterday, I created a new issue for the clang-ast-dump utility: #212.

I rebased this morning, updating the code to work with the recent redesign of Fold.

Details are in the commit message:

Add clang-ast-dump package (#212)

This commit is a squashed version of a number of previous commits, rebased to work with a recent redesign of Fold.

FoldM uses MonadIO, and I refactored the code to call liftIO minimizing changes. Now that we can use a Reader, it is possible to track indentation within the Reader environment, using local for indentation, but I have not implemented that.

The new version of Fold no longer provides parent, as the parent cursor can be queried using the API. I rewrote the code to query both the semantic and lexical parents. When they are equal and are not the target file, just "parent" is displayed. Otherwise, "semantic parent" and "lexical parent" are displayed separately.

This program uses features added in the doxygen branch (#26), but I am referencing the new clang-ast-dump issue (#212).

TravisCardwell commented 1 month ago

The Clang documentation API provides access to Clang's parsed comments. The API only provides access to partially-parsed documentation, however. Notably, it does not include parsed Markdown. During the call yesterday, @edsko had the excellent idea of seeing if the HTML (clang_FullComment_getAsHTML) or XML (clang_FullComment_getAsXML) representations of comments take into account such complete parsing.

Tests:

clang_getSymbolGraphForUSR

Comment: ```c /** * Generate a single symbol symbol graph for the given USR. Returns a null * string if the associated symbol can not be found in the provided \c CXAPISet. * * The output contains the symbol graph as well as some additional information * about related symbols. * * \param usr is a string containing the USR of the symbol to generate the * symbol graph for. * * \param api the \c CXAPISet to look for the symbol in. * * \returns a string containing the serialized symbol graph representation for * the symbol being queried or a null string if it can not be found in the * APISet. */ ``` HTML, reformatted for GitHub comment: ```html

Generate a single symbol symbol graph for the given USR. Returns a null string if the associated symbol can not be found in the provided CXAPISet.

The output contains the symbol graph as well as some additional information about related symbols.

usr: is a string containing the USR of the symbol to generate the symbol graph for.
api: the CXAPISet to look for the symbol in.

Returns a string containing the serialized symbol graph representation for the symbol being queried or a null string if it can not be found in the APISet.

``` XML, reformatted for GitHub comment: ```xml clang_getSymbolGraphForUSR c:@F@clang_getSymbolGraphForUSR CXString clang_getSymbolGraphForUSR(const char *usr, CXAPISet api) Generate a single symbol symbol graph for the given USR. Returns a null string if the associated symbol can not be found in the provided CXAPISet. usr 0 in is a string containing the USR of the symbol to generate the symbol graph for. api 1 in the CXAPISet to look for the symbol in. a string containing the serialized symbol graph representation for the symbol being queried or a null string if it can not be found in the APISet. The output contains the symbol graph as well as some additional information about related symbols. ```

Markdown

Comment: ```c /** * A field preceded by a Doxygen comment * * This is a *simple* example using [Markdown][]. * * [Markdown]: * * > This is a block quote, *not* code. * * putStrLn("This is code.") * * * itemized * * list * * 1. enumerated * 2. list */ ``` HTML: ```html

A field preceded by a Doxygen comment

This is a *simple* example using [Markdown][].

[Markdown]: <https://en.wikipedia.org/wiki/Markdown>

> This is a block quote, *not* code.

putStrLn("This is code.")

* itemized * list

1. enumerated 2. list

``` XML: ```xml a c:@S@S4@FI@a char a A field preceded by a Doxygen comment This is a *simple* example using [Markdown][]. [Markdown]: <https://en.wikipedia.org/wiki/Markdown> > This is a block quote, *not* code. putStrLn("This is code.") * itemized * list 1. enumerated 2. list ```

Unfortunately, neither representation parse the Markdown. Formatting is lost, so Markdown cannot be properly parsed from these representations.

edsko commented 1 month ago

Yes, there are certainly libclang functions that have undefined behaviour (often even segfault) when passed certain arguments (invalid cursors, invalid types, ..). When we discover these, we should add adiditonal checks, like the one you already put in place (the discussion on ensureValidType on the result versus assertValidType on an argument).

TravisCardwell commented 1 month ago

I realized that the HTML/XML representations may be helpful in deciding what to do with the many special commands, so I went ahead and enabled them in clang-ast-dump.

I imagine that we will implement special support for common special commands (such as \param) and implement a generic catch-call for the rest. (I have not looked into it in detail yet, but there may be two catch-alls if there are both inline and block commands.) For example, some special commands would probably look decent if translated to Haddock definition lists.

edsko commented 1 month ago

So are you saying that the HTML/XML representation provides specific rendering for Doxygen commands, and that that detail is not available through the libclang Doxygen API?

TravisCardwell commented 1 month ago

So are you saying that the HTML/XML representation provides specific rendering for Doxygen commands, and that that detail is not available through the libclang Doxygen API?

The special commands are well documented. The HTML/XML from libclang just makes it easy to see how they deal with specific commands. Having that extra information makes our job easier, speeding things up.

We can (also) run Doxygen on test cases to see the resulting HTML. I installed Doxygen on my system for such investigation.

TravisCardwell commented 1 month ago

If we want to correctly parse Doxygen Markdown and translate it to Haddock syntax, we should probably use a library.

Pandoc is (IMHO) not a viable candidate for two reasons:

It is a heavy library. It can significantly increase build times and executable sizes.
It uses a GPL-2.0 license, which may be problematic for users.

Perhaps we can use commonmark, by the same author. It is lightweight in comparison, and it has a 3-Clause BSD license. We should first assess if CommonMark and Doxygen Markdown are suitably similar.

Would it be viable to add commonmark as a dependency of the package, however, taking the transitive dependencies into consideration as well?

edsko commented 1 month ago

Keeping a small dependency footprint is not unimportant, especially since this may run as a TH splice. However, I also don't think we should be overly worried about it. Most of the dependencies of commonmark we already have anyway.

TravisCardwell commented 1 month ago

I investigated how we might output Haddock comments in the preprocessor case. We generate the Haskell AST using haskell-src-exts, and we currently generate the Haskell source code using Pretty, which does not support comments.

Package haskell-src-exts-sc provides a way to include comments. It works as follows:

An AST annotated with Maybe CodeComment is pretty-printed using Pretty, which ignores the annotations.
The pretty-printed code is then parsed with Parser to get an AST that is annotated with source locations.
The two ASTs are combined to produce an AST that is annotated with both comments and source locations, using generics.
That AST is traversed, and source locations for comments are determined while the source locations for the AST elements are updated to make room for the inserted comments.
The resulting AST annotated with the updated source locations and list of Comments (which includes source locations) can then be printed using ExactPrint.

The package is not maintained. The version in Hackage fails to build (with GHC 9.6.6) because it does not load the UndecidableInstances extension, but that issue was fixed in the repository.

My test produced broken output, as a comment was inserted immediately after a brace, starting a multi-line comment.

-- | This is some module documentation.
module Demo where

-- | Foo is a data type
data Foo = Foo{-- | It has a bar!
               bar :: Int, -- | It has a baz!
                           baz :: Int}

The algorithm used for inserting comments is simple, and I think it would work pretty well if the non-commented code put each field (etc.) on a separate line (as they should be with comments). If we want to fix this, I think it would involve forking the Pretty implementation, not making any major changes to haskell-src-exts-sc. (The only change would be to use the forked pretty-printing implementation.) Both packages use the BSD-3 license.

Alternatives include using ghc-exactprint, but that would likely result in a higher maintenance cost.

At any rate, I now have a much better idea of how the translated documentation will be consumed.

TravisCardwell commented 1 month ago

Doxygen documentation may contain references to identifiers.

We translate from C names to Haskell names using a local context. For example, record field names translate to names that include the data type name or constructor name, according to configured options. When we run across an identifier in the documentation, we are not able to perform the same translation without context.

If we want to translate references, perhaps we can accumulate a reference map (from C names to Haskell names) during code translation and then pass that map to a subsequent step that translates the documentation. Any identifier that is in the reference map would then be able to be translated.

An easy yet unhelpful alternative is to not translate references. All references could be rendered as code (@reference@), resulting in Haskell documentation that still "references" the C identifiers.

TravisCardwell commented 1 month ago

I ran into what seemed to be strange behavior, and I finally realized that it is not strange but rather a mistake in the documentation in clang-c/Documentation.h (for libclang 18)!

/**
 * Convert a given full parsed comment to an HTML fragment.
 *
...
 * \li "para-brief" for \paragraph and equivalent commands;
...
 */
CINDEX_LINKAGE CXString clang_FullComment_getAsHTML(CXComment Comment);

The mistake is in the middle line, which should have \\paragraph with an escaped backslash to reference the \paragraph command instead of use it. The \paragraph documentation includes a warning:

This command only works inside a subsubsection of a related page documentation block and not in other documentation blocks!

It is in one of the "other documentation blocks" here. Not only does it serve as a no-op, it breaks the list item so that the following text is parsed as a verbatim line in between list items.

In documentation where an escaped backslash is used to reference a command, the escaped backslash and following text are represented in separate (adjacent) text elements.

EDIT

In clang-c/Index.h:

/**
 * @}
 */

/**
 * \defgroup CINDEX_MODULE Module introspection
 *
 * The functions in this group provide access to information about modules.
 *
 * @{
 */

The \defgroup command is used for organization, but it looks like libclang does not support it. As with the above \paragraph mistake, the command is not recognized. From the libclang AST, there is no reference to the command. We cannot know which command is there, though we can tell that something did not work because it breaks the flow of the documentation and (incorrectly) parses text after the problematic command as a verbatim line.

Such issues are the only places where I see verbatim lines in the parsed ASTs. I am not even sure how to create a verbatim line aside from this, and the documentation does not provide any clues.

TravisCardwell commented 1 month ago

I am switching back to other priorities, but I was really close to finishing the first implementation of the reifying of the libclang documentation AST and went ahead and did so this morning. Perhaps it is a good idea to go ahead and create a PR and merge it so that the branch does not get lost/forgotten/stale.

The "kind" of a CXComment determines what it contains, but the C type system of course cannot constrain return values by kind. For example, it is possible to get a CXComment of any kind even if a CXComment of a kind that represents block content is expected, from the perspective of the type system. I do not expect there to be issues with this, but I went ahead and implemented defensively just in case.

Haskell type Comment is a top-level comment, which corresponds to the CXComment_FullComment kind. If any other kind is found, it is put into an appropriate type so that we do not lose documentation. Note that Comment also includes the display name of the CXCursor, which we may use to document the original C name.

Kind CXComment_Null is not represented with a type in Haskell. The (high-level) clang_getComment function returns a Maybe Comment, where Nothing corresponds to a CXComment_Null CXComment.

The rest of the CXComment kinds are organized into block (CommentBlockContent) and inline (CommentInlineContent) content. When block content is expected but inline content is found, that inline content is wrapped in an appropriate block container so that it is not lost. When inline content is expected but a paragraph is found, the paragraph content is returned so that it is not lost. When inline content is expected but other block content is found, it is ignored. There is not an elegant way to handle this case, which should never happen anyway.

well-typed / hs-bindgen

Include Haddocks for exported (low-level) bindings #26