Open edsko opened 3 months ago
I defined the wrappers in doxygen_wrappers.h
as well as the corresponding Haskell API in HsBindgen.Clang.Doxygen
and the necessary enumerations in HsBindgen.Clang.Doxygen.Enums
, in the doxygen
branch.
Of the content in clang-c/Documentation.h
, the following is not (yet) implemented:
CXComment
attributes ASTNode
and TranslationUnit
are not implemented. I think that we will traverse the AST top-down, so I doubt we will need these.CXAPISet
, as well as related functions clang_createAPISet
and clang_disposeAPISet
, are not implemented. I do not think that we will need these for our purposes.clang_getSymbolGraphForUSR
and clang_getSymbolGraphForCursor
are not implemented. I do not think that we will need these for our purposes.Please feel free to let me know if we will need something that I have not implemented, though I will no doubt find out when I proceed with implementation.
I am currently implementing an ast-dump
utility that walks the AST and dumps information so that I can see exactly how the AST is structured. The LLVM tool works fine, of course; I want to see details via our library.
Just to get started quickly, I started implementing it as a hs-bindgen-libclang
test suite, like the tutorial. It could be useful in the long run, though. Perhaps it should be put in a separate package, so that more dependencies can be used. For example, it would probably be worthwhile to implement a CLI parser using optparse-applicative
to provide options to select what information is displayed. I am being very verbose with documentation output since that is what I am working on, but such information would get in the way when working on something else.
BTW, output is currently formatted in Markdown
I have not committed the code yet, but I should be able to do so tomorrow.
I wonder if it is worth adding HsBindgen.Clang.Util.Classification
predicates for the documentation API.
Would the existing show-clang-ast
command help with this? Or do you need more info than it provides?
Would the existing
show-clang-ast
command help with this? Or do you need more info than it provides?
I hadn't seen that yet! Nice!
Currently, I am inspecting the parsed comments in detail.
I finished the initial version of the Clang AST dump utility, now named clang-ast-dump
. I put it in a package of the same name, in the doxygen
branch. This package is (tentatively) temporary, just used for current development. It can be enabled by creating a symbolic link, as described in the README.
The purpose of the utility is to dump (detailed) information about the Clang AST. The existing show-clang-est
command traverses the Haskell types that we create from the Clang data structures, while this utility dumps information from the Clang data structures themselves. Elements are output without hierarchy; the dumped information is displayed hierarchically, in Markdown format.
Various options configure what is dumped. Notably, --same-file
can be used to only dump things from the specified file, not things included by that file. Other boolean options configure what types of details are displayed.
Currently, the utility only recurses on the same (four) cursor kinds as HsBindgen.C.Parser.foldDecls
. Other cursor kinds output CURSOR_KIND_NOT_IMPLEMENTED
; if we see that in the output of something that we are investigating, we know more information may be available.
The comment dumping is complete, since that is what I am working on.
I am now using this utility to investigate how real-world Doxygen comments are parsed by Clang. I will build up a plan for how to translate these to Haddock comments. I am starting with clang-c/Index.h
and clang-c/Documentation.h
.
Let's discuss this in our call today.
During the call yesterday, I mentioned that I saw some unhelpful (empty) failures when experimenting with recursing on some more cursor kinds. The failure is not an issue at this time, as we do not support those cursor kinds yet, but we might want to change how we raise the exception so that it is more informative.
I remember that CXCursor_FunctionDecl
caused this issue. Here is the error, which has an empty hint.
clang-ast-dump: CallFailed "" CallStack (from HasCallStack):
collectBacktrace, called at src/HsBindgen/Clang/Internal/Results.hs:41:14 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Internal.Results
callFailed, called at src/HsBindgen/Clang/Internal/Results.hs:68:12 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Internal.Results
ensureOn, called at src/HsBindgen/Clang/Internal/Results.hs:54:10 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Internal.Results
ensure, called at src/HsBindgen/Clang/Core.hs:1055:29 in hs-bindgen-libclang-0.1.0-inplace:HsBindgen.Clang.Core
clang_getTypeSpelling, called at Main.hs:74:32 in clang-ast-dump-0.0.0.0-inplace-clang-ast-dump:Main
I could link to the code, but those links will very quickly be invalid when I rebase this morning, so I will copy-and-paste.
The above failure occurred when recursing on the clang_Cursor_getParsedComment(CXCursor)
function of clang-c/Documentation.h
. The cursor is for the default
attribute, with cursor kind simpleEnum CXCursor_VisibilityAttr
.
* "default"
* parent: "clang_Cursor_getParsedComment(CXCursor)"
* extent: ("/usr/include/clang-c/Documentation.h",(47,1),(47,15))
* cursor kind: simpleEnum CXCursor_VisibilityAttr
Main.hs:72-74
:
cursorType <- clang_getCursorType cursor
let typeKind = cxtKind cursorType
traceU 1 "cursor type" =<< clang_getTypeSpelling cursorType
The documentation for clang_getCursorType()
states:
Retrieve the type of a
CXCursor
(if any).
It says "if any" but does not provide details about the "Nothing
" case. Perhaps the type kind is CXType_Invalid
and I simply need to check for that before attempting to dump any information about the type. That would be trivial to fix.
src/HsBindgen/Clang/Core.hs:1054-1057
:
clang_getTypeSpelling :: HasCallStack => CXType -> IO Text
clang_getTypeSpelling typ = ensure (not . Text.null) $
onHaskellHeap typ $ \typ' ->
preallocate_$ wrap_getTypeSpelling typ'
That is it! That ensure
call fails if the returned Text
is empty, and it constructs a CallFailed
exception using that empty Text
as the hint. That makes perfect sense. By design, CallFailed
uses values to communicate errors, and the backtrace provides the context that is required to make sense of an error. Since clang_getTypeSpelling
returns Text
, I should have realized that the empty string is the error case for a non-null check. I do not think that any changes are needed.
If we would like to provide easier-to-understand errors in such cases, however, perhaps the easiest way would be to define an alternative version of ensure
that allows the caller to specify the CallFailed
hint.
ensure' :: (HasCallStack, Typeable e, Show e)
=> (a -> Bool)
-> (a -> e)
-> IO a
-> IO a
ensure' = ensureOn' id
ensureOn' :: (HasCallStack, Typeable e, Show e)
=> (a -> b)
-> (b -> Bool)
-> (a -> e)
-> IO a -> IO a
ensureOn' f p mkHint call = do
x <- call
if p (f x)
then return x
else callFailed (mkHint x)
Example usage:
data CallFailedHint = EmptyResult | NullPtrResult
deriving Show
ensureNotEmpty :: (HasCallStack, Eq a, Monoid a) => IO a -> IO a
ensureNotEmpty = ensure' (/= mempty) (const EmptyResult)
ensureNotNull :: (HasCallStack, Coercible a (Ptr x)) => IO a -> IO a
ensureNotNull = ensure' (not . isNullPtr) (const NullPtrResult)
clang_getTypeSpelling :: HasCallStack => CXType -> IO Text
clang_getTypeSpelling typ = ensureNotEmpty $
onHaskellHeap typ $ \typ' ->
preallocate_$ wrap_getTypeSpelling typ'
As discussed in the call yesterday, I created a new issue for the clang-ast-dump
utility: #212.
I rebased this morning, updating the code to work with the recent redesign of Fold
.
Details are in the commit message:
Add clang-ast-dump package (#212)
This commit is a squashed version of a number of previous commits, rebased to work with a recent redesign of
Fold
.
FoldM
usesMonadIO
, and I refactored the code to callliftIO
minimizing changes. Now that we can use aReader
, it is possible to track indentation within theReader
environment, usinglocal
for indentation, but I have not implemented that.The new version of
Fold
no longer providesparent
, as the parent cursor can be queried using the API. I rewrote the code to query both the semantic and lexical parents. When they are equal and are not the target file, just "parent" is displayed. Otherwise, "semantic parent" and "lexical parent" are displayed separately.This program uses features added in the
doxygen
branch (#26), but I am referencing the newclang-ast-dump
issue (#212).
The Clang documentation API provides access to Clang's parsed comments. The API only provides access to partially-parsed documentation, however. Notably, it does not include parsed Markdown. During the call yesterday, @edsko had the excellent idea of seeing if the HTML (clang_FullComment_getAsHTML
) or XML (clang_FullComment_getAsXML
) representations of comments take into account such complete parsing.
Tests:
Generate a single symbol symbol graph for the given USR. Returns a null string if the associated symbol can not be found in the provided CXAPISet.
The output contains the symbol graph as well as some additional information about related symbols.
Returns a string containing the serialized symbol graph representation for the symbol being queried or a null string if it can not be found in the APISet.
A field preceded by a Doxygen comment
This is a *simple* example using [Markdown][].
[Markdown]: <https://en.wikipedia.org/wiki/Markdown>
> This is a block quote, *not* code.
putStrLn("This is code.")
* itemized * list
1. enumerated 2. list
``` XML: ```xmlUnfortunately, neither representation parse the Markdown. Formatting is lost, so Markdown cannot be properly parsed from these representations.
Yes, there are certainly libclang
functions that have undefined behaviour (often even segfault) when passed certain arguments (invalid cursors, invalid types, ..). When we discover these, we should add adiditonal checks, like the one you already put in place (the discussion on ensureValidType
on the result versus assertValidType
on an argument).
I realized that the HTML/XML representations may be helpful in deciding what to do with the many special commands, so I went ahead and enabled them in clang-ast-dump
.
I imagine that we will implement special support for common special commands (such as \param
) and implement a generic catch-call for the rest. (I have not looked into it in detail yet, but there may be two catch-alls if there are both inline and block commands.) For example, some special commands would probably look decent if translated to Haddock definition lists.
So are you saying that the HTML/XML representation provides specific rendering for Doxygen commands, and that that detail is not available through the libclang
Doxygen API?
So are you saying that the HTML/XML representation provides specific rendering for Doxygen commands, and that that detail is not available through the
libclang
Doxygen API?
The special commands are well documented. The HTML/XML from libclang
just makes it easy to see how they deal with specific commands. Having that extra information makes our job easier, speeding things up.
We can (also) run Doxygen on test cases to see the resulting HTML. I installed Doxygen on my system for such investigation.
If we want to correctly parse Doxygen Markdown and translate it to Haddock syntax, we should probably use a library.
Pandoc is (IMHO) not a viable candidate for two reasons:
Perhaps we can use commonmark
, by the same author. It is lightweight in comparison, and it has a 3-Clause BSD license. We should first assess if CommonMark and Doxygen Markdown are suitably similar.
Would it be viable to add commonmark
as a dependency of the package, however, taking the transitive dependencies into consideration as well?
Keeping a small dependency footprint is not unimportant, especially since this may run as a TH splice. However, I also don't think we should be overly worried about it. Most of the dependencies of commonmark
we already have anyway.
I investigated how we might output Haddock comments in the preprocessor case. We generate the Haskell AST using haskell-src-exts
, and we currently generate the Haskell source code using Pretty
, which does not support comments.
Package haskell-src-exts-sc
provides a way to include comments. It works as follows:
Maybe CodeComment
is pretty-printed using Pretty
, which ignores the annotations.Parser
to get an AST that is annotated with source locations.Comment
s (which includes source locations) can then be printed using ExactPrint
.The package is not maintained. The version in Hackage fails to build (with GHC 9.6.6) because it does not load the UndecidableInstances
extension, but that issue was fixed in the repository.
My test produced broken output, as a comment was inserted immediately after a brace, starting a multi-line comment.
-- | This is some module documentation.
module Demo where
-- | Foo is a data type
data Foo = Foo{-- | It has a bar!
bar :: Int, -- | It has a baz!
baz :: Int}
The algorithm used for inserting comments is simple, and I think it would work pretty well if the non-commented code put each field (etc.) on a separate line (as they should be with comments). If we want to fix this, I think it would involve forking the Pretty
implementation, not making any major changes to haskell-src-exts-sc
. (The only change would be to use the forked pretty-printing implementation.) Both packages use the BSD-3 license.
Alternatives include using ghc-exactprint
, but that would likely result in a higher maintenance cost.
At any rate, I now have a much better idea of how the translated documentation will be consumed.
Doxygen documentation may contain references to identifiers.
We translate from C names to Haskell names using a local context. For example, record field names translate to names that include the data type name or constructor name, according to configured options. When we run across an identifier in the documentation, we are not able to perform the same translation without context.
If we want to translate references, perhaps we can accumulate a reference map (from C names to Haskell names) during code translation and then pass that map to a subsequent step that translates the documentation. Any identifier that is in the reference map would then be able to be translated.
An easy yet unhelpful alternative is to not translate references. All references could be rendered as code (@reference@
), resulting in Haskell documentation that still "references" the C identifiers.
I ran into what seemed to be strange behavior, and I finally realized that it is not strange but rather a mistake in the documentation in clang-c/Documentation.h
(for libclang
18)!
/**
* Convert a given full parsed comment to an HTML fragment.
*
...
* \li "para-brief" for \paragraph and equivalent commands;
...
*/
CINDEX_LINKAGE CXString clang_FullComment_getAsHTML(CXComment Comment);
The mistake is in the middle line, which should have \\paragraph
with an escaped backslash to reference the \paragraph
command instead of use it. The \paragraph
documentation includes a warning:
This command only works inside a subsubsection of a related page documentation block and not in other documentation blocks!
It is in one of the "other documentation blocks" here. Not only does it serve as a no-op, it breaks the list item so that the following text is parsed as a verbatim line in between list items.
In documentation where an escaped backslash is used to reference a command, the escaped backslash and following text are represented in separate (adjacent) text elements.
EDIT
In clang-c/Index.h
:
/**
* @}
*/
/**
* \defgroup CINDEX_MODULE Module introspection
*
* The functions in this group provide access to information about modules.
*
* @{
*/
The \defgroup
command is used for organization, but it looks like libclang
does not support it. As with the above \paragraph
mistake, the command is not recognized. From the libclang
AST, there is no reference to the command. We cannot know which command is there, though we can tell that something did not work because it breaks the flow of the documentation and (incorrectly) parses text after the problematic command as a verbatim line.
Such issues are the only places where I see verbatim lines in the parsed ASTs. I am not even sure how to create a verbatim line aside from this, and the documentation does not provide any clues.
I am switching back to other priorities, but I was really close to finishing the first implementation of the reifying of the libclang
documentation AST and went ahead and did so this morning. Perhaps it is a good idea to go ahead and create a PR and merge it so that the branch does not get lost/forgotten/stale.
The "kind" of a CXComment
determines what it contains, but the C type system of course cannot constrain return values by kind. For example, it is possible to get a CXComment
of any kind even if a CXComment
of a kind that represents block content is expected, from the perspective of the type system. I do not expect there to be issues with this, but I went ahead and implemented defensively just in case.
Haskell type Comment
is a top-level comment, which corresponds to the CXComment_FullComment
kind. If any other kind is found, it is put into an appropriate type so that we do not lose documentation. Note that Comment
also includes the display name of the CXCursor
, which we may use to document the original C name.
Kind CXComment_Null
is not represented with a type in Haskell. The (high-level) clang_getComment
function returns a Maybe Comment
, where Nothing
corresponds to a CXComment_Null
CXComment
.
The rest of the CXComment
kinds are organized into block (CommentBlockContent
) and inline (CommentInlineContent
) content. When block content is expected but inline content is found, that inline content is wrapped in an appropriate block container so that it is not lost. When inline content is expected but a paragraph is found, the paragraph content is returned so that it is not lost. When inline content is expected but other block content is found, it is ignored. There is not an elegant way to handle this case, which should never happen anyway.
When we generate foreign imports from C headers, if the declarations in those headers have documentation attached, then we should include that documentation in the generated bindings also. Ideally we would be using Haddock formatting here, but given the absence of a general agreed standard for the documentation of C headers, that might be hard to do in general. Nonetheless, we could at least support common formats.