translate-c: ability to annotate types in C code for better translation

ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

https://ziglang.org

MIT License

34.6k stars 2.53k forks source link

translate-c: ability to annotate types in C code for better translation #2457

Open andrewrk opened 5 years ago

andrewrk commented 5 years ago

If upstream C sources are willing to add extra annotations to their header files, Zig can perform better automatic translation. Here are some examples of things that could be annotated:

Pointer types could be annotated as non-null
Pointer types could be annotated as single-item, unknown-length, or null-terminated (#265)

GCC/Clang have the nonnull attribute which can be applied to function parameters, but not types in general. Certainly Zig should respect this attribute on function parameters, but there is also room for coming up with a way to annotate types as well.

kristoff-it commented 5 years ago

If we agree that the central point of this issue is making the best use possible of header files, then I can say that we have two fronts that we can attack the issue from:

1. Changing the header file Which means that the C project we want to integrate with wants to collaborate with us.

2. Adding knobs to @cInclude() This would allow us to tweak from Zig the result of importing an header file.

I suspect a full solution requires both, here's my reasoning:

We have to deal with 2 problems that in my opinion require different solutions: A. We want to make some the arguments of some functions non-optional B. We want to unpack some variables (more likely, variables containing function pointers)

I can see A being solved via nonnull or some similar annotation in the header file, while it seems to me it would be much more unwieldy to try and alter function signatures after the fact (please tell me if I'm wrong, I can only reason about this from the PoV of what the usage could look like from Zig).

B, instead, seems solvable from Zig with something like this:

const includeOpts = IncludeOpts {
   .makeNonOptional = [][] const u8{"RedisModule_*"},
};
const redis = @cImport({
    @cInclude("./redismodule.h", includeOpts);
});

The advantages of something like this would be the following:

You don't need cooperation from the C project to do it (while still making use of the header file)
In debug and/or release safe mode, we can add a runtime check to the values that we are trying to unpack, and return a meaningful error when a failure happens.

Being able to specify by pattern matching which symbols you want to unpack (as I do in my example) would make this feature very easy to use, hardly any boilerplate.

I imagine a similar argument could be made for single-item, unknown-length and null-terminated pointers (both for casting variables and re-interpreting function arguments).

So it would be interesting to see if it is possible to provide the same choice to A (removing optionals from function arguments), but I don't know enough to say anything useful about it. Likewise, maybe it would be worth to come up with a symbol that the C project can add in front of variables that they want to mark as non-null once the header file has finished executing.

The idea of offering a cooperative and a non-cooperative option seems intriguing.

kristoff-it commented 5 years ago

After watching yesterday's stream (great timing on that btw) and thinking about it a bit more I have a couple more thoughts to share. After listening to your explanation and reading some of the self-hosted code, I'm starting to get the hang of how translate-c works. I now understand that the process goes like this: c_code -(clang)-> clang_ast -> zig_ast -> zig_code.

I was originally imagining adding some kind of macro to the c header file but I see now that that's not possible, unless what we add is legit C syntax. I was originally naively thinking of some weird ifdef macro that would add an extra attribute to function arguments, but that would have been awful, and very likely to be rejected by any C maintainer.

nonnull seems good for function signatures, it does exactly what we want it to in terms of nullability but doesn't help with the other pointer issues, nor with variables. I also have to say that I don't like the syntax: a list of arg indexes seems a bit error prone to me)

Here's an alternative idea: what if we add these annotations as comments?

//zig: (non-null, non-null to-many, ...)
typedef int (*RedisModuleCmdFunc)(RedisModuleCtx *ctx, RedisModuleString **argv, int argc);

If we find a way to make the syntax reasonably human-readable, we could also consider dropping any Zig reference, in order to make the comment just a harmless hint for people reading the headerfile.

More complete example:

//pointer-hints: (non-null, (non-null to-many) -> non-null, non-pointer)
typedef int (*RedisModuleCmdFunc)(RedisModuleCtx *ctx, RedisModuleString **argv, int argc);

//var-usage-hint: RedisModule_* non-null
#define REDISMODULE_API_FUNC(x) (*x)
int REDISMODULE_API_FUNC(RedisModule_ReplyWithSimpleString)(RedisModuleCtx *ctx, const char *msg);

pointer-hints seems nice to me, but I'm not so sure about var-usage-hint. Another reason why I think pointer-hints is good, is because you only need it for callbacks, as you don't have to type extra code for calling a function that accepts optionals.

var-usage-int instead is either repeated for each var of type pointer, or you do pattern matching. Repeating seems intrusive, and pattern matching seems too obscure for people to understand when reading the comment.

One last thing: I noticed that I made a very wrong assertion in the last comment. I stated something along the lines that variables will be initialized once the headerfile is executed, but that's not true at all in redismodule.h, since we are calling RedisModule_Init from our module code. Could this have an impact on the feasibility of altering the behavior of @cInclude()?

Sahnvour commented 5 years ago

I like the "knobs" idea proposed by @kristoff-it, but understand it might be complex offering this configurability right into the compiler. But there's no problem which can't be solved by adding extra layers of indirection, and maybe comptime can help here to keep it as userland as possible ?

Anyways, I think it's highly unlikely that upstream C projects would agree on annotating their code for consumption by another language's bindings. We would be better off working without expecting write access to the source.

daurnimator commented 5 years ago

Anyways, I think it's highly unlikely that upstream C projects would agree on annotating their code for consumption by another language's bindings.

For many cases it's more than that e.g. gcc's nonnull attribute is useful for C consumers too.

But there's an additional dimension: when zig is used to generate .h files: we should be able to map back to perfect zig type signatures if that .h file is imported with @cImport.

Sahnvour commented 5 years ago

For many cases it's more than that e.g. gcc's nonnull attribute is useful for C consumers too.

True, but unfortunately there might not be an attribute directly mapping to the zig concept we need to annotate. And even then, I'm not convinced maintainers would agree to add compiler specific attributes (right or wrong). Also, some libraries could just not be practically modifiable without doing a fork, for various reasons.

The safe choice is not depending on upstream for this, and a pure zig solution might even allow "annotation" packets to be distributed with the associated pure-C library, streamlined inside zig's packet manager.

andrewrk commented 5 years ago

Good points all around.

I want to make note of another use case - which is when you want to use translate-c to port C code to Zig code, and you intend to drop the C code and start maintaining the Zig code. In such a situation translate-c is a one-time process after which the generated Zig code becomes source code maintained by hand. For this use case, one would be willing to do all manner of Zig-specific annotations in the C code, just to make the generated Zig code more readable.

kristoff-it commented 5 years ago

Maybe I can offer some more input after writing some code that made use of this functionality, although my experience is limited to Redis Modules.

In my use-case I encountered 2 issues:

The header file has a function to gather all the main API functions and save them in different variables. Unfortunately this means that those variables will all be optional pointers (i.e. requiring unpacking before use, e.g. RedisModule_Foo.?()).
Callbacks (so functions that you have to define) have annoying signatures, making every argument of type pointer an optional, and in my case those pointers are never going to be null.

After some first-hand experience, these feel to me the most natural ways of dealing with those problems:

1.

1 Should be solved on the Zig side. In my case (but I assume it's the same for all libraries), those variables are going to be null until you call RedisModule_Init() so any trickery at the header file level would introduce undesirable problems (e.g. if you try to call one of those functions before running init). I think the right solution for this would be to automatize as much as possible the unpacking.

const c_utils = @import("std").c_utils;
const redis = @cImport({
    @cInclude("./redismodule.h");
});
const redis_api = comptime c_utils.CTranslatePrepareUnpacking(redis, "RedisModule_*");

// Entry point
export fn RedisModule_OnLoad(...) c_int {
   // call the procedure that gathers function pointers
   redis.RedisModule_Init.?(...);

   // unpack functions once and for all
   c_utils.CTranslateUnpack(redis_api, redis)
}

In this example CTranslatePrepareUnpacking would declare a struct identical to redis but with the matching fields made non-optional. Then, after we initiate the API, CTranslateUnpack will effectively copy all pointers from one struct to the other, failing if one of the pointers turns out to be null. This is basically what you would otherwise do by hand I think.

There are probably better ways to do design the API for this. I just wanted to make explicit the fact that it has to be a 2 step process (declare a type at comptime, unpack pointers at runtime when the user is ready to do that).

With support for @reify this seems trivial to implement in userspace. If we decide to follow this route, I would be happy to contribute to the implementation :)

2.

2 Seems very problematic to solve from Zig. It's about redefining types and/or signatures that use such types. In my case, since Redis modules have basically a single type of callback, the shortest path was to edit manually the line where the type is defined, which in turn cascaded automatically to all other function that took a pointer of that type as an argument. I think this should be solved by somehow patching the C code, and if there is no elegant way to express this in the code, then just make it easy to write by hand. This step is anyway possible only if you have intimate knowledge of the C code you're dealing with.

One good property for our type-redefining annotations would be to be position-independent. This way we could transparently write them by hand (ideally in the right place, like the examples in my previous comments) or just blindly inject them in the beginning of the header file via a @cImport option, without having to know which precise line contains the definition.

With 1 and 2 at our disposal it would be easy to both write a little bit of boilerplate when importing the .h file directly, and/or create a "wrapper" that adds value without being particularly hard to maintain. Having the heavy lifting done by Zig, the wrapper could focus on getting the idiom right.

not-fl3 commented 5 years ago

I want to mention one of use cases of custom C types annotation: default zeroes for "designated initialization". Some C libraries heavily use the fact that C structs are sometime zero-initialized. It would be nice if c-translate will add defaults to each struct field.

Probably the best example: sokol-gfx library, https://github.com/ziglang/zig/issues/1031:

https://github.com/floooh/sokol-samples/blob/master/glfw/triangle-glfw.c#L40

/* prepare a resource binding struct, only the vertex buffer is needed here */
sg_bindings bind = {
    .vertex_buffers[0] = vbuf
};

sokol heavily relies on designated initializers. sg_bindings have a lot of fields, but the library expects that user will specify only one of them.

The other example is static arrays:

libwebsockets usage C code:

static struct lws_protocols protocols[] = {
    { "http", lws_callback_http_dummy, 0, 0 },
    LWS_PLUGIN_PROTOCOL_MINIMAL,
    { NULL, NULL, 0, 0 } /* terminator */
};

Which would be nice to use the same in zig like:

const server_protocols = [_]lws_protocols {
    lws_protocols {
        .name = c"http-only";
        .callback = callback_http;
        .per_session_data_size = @sizeOf(usize);
        // here should be other 4 fields, previously zero initialized by default
    },
    ....
}

Right now its possible to make a zig "wrapper" library, mirroring each generated C structs, but with defaults. But its a lot of work and will require a lot of pointer casting to use that structs in c functions. I think that that @cInclude annotation (or any other solution) will make a lot of C libs usage significantly more convenient without additional zig wrappers.

daurnimator commented 5 years ago

Right now its possible to make a zig "wrapper" library, mirroring each generated C structs, but with defaults. But its a lot of work and will require a lot of pointer casting to use that structs in c functions.

Is this a good @reify use-case? #383

not-fl3 commented 5 years ago

@daurnimator I made an example illustrating this idea: https://github.com/not-fl3/sokol-zig You can see that zig version is very close to the original version.

To make it so close - I hacked on translate-c for a bit it created default value for each struct field.

pub const struct_sg_draw_state = extern struct {
    _start_canary: u32 = @import("std").mem.zeroInit(u32),
    pipeline: sg_pipeline = @imsg_imageport("std").mem.zeroInit(sg_pipeline),
..
    fs_images: [12] = @import("std").mem.zeroInit([12]sg_image),
    _end_canary: u32 = @import("std").mem.zeroInit(u32),
};

And this hack worked surprisingly well - all the sokol examples now can be translated to zig in the very same style without any wrappers, just @cInclude.

I do understand that this should not be default behaviour, and that zeroInit functions should not be placed in std.mem. But I think - its an example of one of the possible results of @cInclude annotation or reifiyng each C type.

My goal here - to deomnstrate one of the C libraries that would be super nice to have and be easy to use from zig and find how I could help to make it happen.

daurnimator commented 4 years ago

A thought:

If we translate the snippet

extern const char foo[];

Then I think we should assume it is a [*] rather than using a c pointer?

kristoff-it commented 4 years ago

An update after gaining more experience in working with Zig, and using translate-c to write Redis modules.

I'll focus my reasoning more on usage patterns and less on the technical details. I'll also use the Redis module use case as my main example because that's the one real-world case that I have experience in, which means what I say might not apply just as well to other uses for translate-c but at least it's clear where I'm coming from.

As a user interested in writing a Redis module in Zig, importing redismodule.h directly should probably be the preferred starting point because this way I can read the original documentation and apply it 1:1 to my Zig code. Function names will be the same and signatures will be mapped in a consistent way. Assuming translate-c is bug-free, I'll also know that if something doesn't work it's most probably going to be my code.

As a Redis expert, I might be interested in maintaining a wrapper around redismodule.h for Zig developers. To make it worth for the developer to use my wrapper (vs just importing the headerfile directly), I would need to offer a good improvement in ergonomics. In my case for example I'd want to:

Trim function names to remove the RedisModule_ prefix because in Zig we have proper namespacing.
Make pointers (that I know should never be null) non-optional, be it pointers provided by the system or pointers provided by the user
Implement a Zig allocator interface for the Redis module allocator
Make some methods more idiomatic (eg, return zig errors)

At this point users will have to read my documentation to understand the changes that I made from the original API so that they know how to "translate" what they read in the original documentation (eg, function names are the same except for the RedisModule_ prefix that was removed). Whenever something breaks they will have to question if the breakage is imputable to my wrapper (or their understanding of it) vs the logic in their code, but the improved ergonomics should make up for it.

From this premise, here's a point that might be worth discussing:

End users vs wrapper maintainers

Maybe we shouldn't consider the end user personalizing the way Zig imports a C header file as the primary use case. In my previous comment, when I was sketching how the API might look like, I imagined that the end user would be writing that code, but now I think this shouldn't be the case. Having end-users the main focus of this API would imply:

That the API would need to not only be simple, but also easy (like in my previous example), which would limit the way it could interact with the import system
There would be a lot of duplication of effort and inconsistency, as everybody would do it their own way and they probably wouldn't bother improving API substantially, basically making the "extra complexity" vs "better ergonomics" trade off less worthwhile.

I think we should consider wrapper maintainers the main user of any such API, which means two things:

The API can be more complex in order give more control to the maintainer over how the headerfile is imported (e.g., it could allow to manipulate the C AST or the Zig AST during the translation process)
The API should be geared towards allowing the maintainers to do the minimal amount of work to update the wrapper when a new version of the C headerfile gets released. For example, instead of allowing users to manipulate any AST, we could add a few custom annotations to the headerfile and expect git to perform a sensible merge of the new headerfile version, with the occasional touch up from the maintainer.

Finally, focusing on "wrapper maintaners" also means that we can go nuts with C annotations and magic comments because we don't have to worry anymore about trying to convince upstream to accept changes, or having end users rummage through the headerfile by themselves.

tadeokondrak commented 4 years ago

Clang has _Nullable, _Nonnull, and _Null_unspecified qualifiers for any pointer type, mostly used by Objective-C/Swift, but are available for C headers too. It would be a nice small improvement to translate-c, but doesn't really work since C pointers are allowzero. I think C pointers shouldn't be allowzero by default, and translate-c should just output ?[*c] for them.

Here's the small bit of glue code to get the attribute if anyone wants to work on it: https://gist.github.com/tadeokondrak/514d0486d04a3b74c0e7383966505b1c The information is only available in the .Attributed arm of the transType switch, so it will require some reorganization there.

SpexGuy commented 4 years ago

I've made two large Zig -> C bindings now, for Vulkan and Dear ImGui (via cimgui), so I'd like to share the things that I'm considering when preparing a binding in order to better inform this feature.

First, I'd prefer not to modify the header file directly. The ability to update the C library in-place and know that I didn't accidentally break anything is really important to me, so I'd prefer to specify any metadata for the generator externally.

Second, excluding old and very stable apis, generation isn't only run once. It's run every time the underlying C library updates. So editing the .zig file by hand is a no-go for large apis, because you would need to specify every pointer type again for the entire api when updating. Instead you need a set of rules that the binding generator can use to generate pointer types. These rules will be specific to the api that you are binding, but should be specified in a way that is relatively resilient to updates. For Vulkan this is easy, they provide an xml file that describes nullability and array-ness. For Dear ImGui I made a list of rules that match against hierarchical context information. This way when the library updates I'll be able to update and keep most of the pointer types. Anything that changed in the API I'll be able to inspect in a diff, and add more rules if any incorrect choices were made or if anything was missed.

I also usually provide a second-tier api that demangles names, wraps pointer+length into slices, error codes into errors, and flag enums into packed structs of bools, along with occasional custom wrappers to make certain operations easier. These need even more codebase-specific information than the first-tier api, but for large apis that update frequently generation is still easier than writing everything by hand. The ability to override the generation of specific functions and supply your own implementation is important for some cases, especially when wrapping functions that use *c_void.

Maybe generating the second-tier api is out of scope for translate-c. But if translate-c could output an easy-to-parse metadata file containing all of the definitions (or a library ready to import it), it would make custom code to generate the second-tier api much easier to write.

kitoran commented 2 years ago

Would be cool if I could annotate enum as having no aliased enumerators, and such enum would be translated to a zig enum and not a bunch of integer constants