well-typed / hs-bindgen

Automatically generate Haskell bindings from C header files
15 stars 0 forks source link

Allow choice between platform independent and platform dependent API #168

Open edsko opened 6 days ago

edsko commented 6 days ago

Explained in this comment of PrimType:

-- | Primitive type
--
-- The interpretation of the primitive types is in many cases implementation
-- and/or machine dependent. In @hs-bindgen@ we are dealing with one specific
-- implementation (@libclang@), and we are generating code for one specific
-- machine (possibly cross platform). This means we have a choice; suppose we
-- see a field of type @int@ in a C struct:
--
-- 1. We could produce a field of type 'CInt' in the generated Haskell code
-- 2. We could query @libclang@ to what choice it makes for the selected
--    target platform, and use 'CShort' or 'CLong' (or something else again.
--
-- Both options have advantages; most users will probably prefer (1), so that
-- we generate a /single/ API, independent of implementation details. However,
-- some users may prefer (2) in some cases, if they want to take advantage of
-- specific features of the target platform.
--
-- We don't force the decision here, but simply represent the C AST faithfully.
data PrimType =

We are defaulting to option (1) currently, but as the comment says, we may wish to give users a choice.

phadej commented 5 days ago

We are not really implementing (1). We hard code offsets and alignment for structs for example. For (1) one would really need to generate hsc2hs code or "formulas" for size and alignment.

I don't think that (2) even makes sense for TH generation, TH is run on the target architecture anyway, so doing something else feels unnecessary. TH could assume host = target, but I guess true (1) would be better at least from testing perspective.


  1. We could query @libclang@ to what choice it makes for the selected -- target platform, and use 'CShort' or 'CLong' (or something else again.

This doesn't feel right. I think that if we do (2) and libclang says "unsigned 4 byte integer" then we should use Word32 and not try to figure out which C-type on the host is the same size as the target type. In other words, be as explicit as possible.


For (1) we also need https://github.com/well-typed/hs-bindgen/issues/134, if the header has struct foo { uint64_t bar };, the field is uint64_t on all machines, though libclang will tell us some "primitive" c-type (at least after we look through typedefs). Similarly, for something like uint_fast32_t which is actually quite tricky to represent otherwise than as uint_fast32_t. (Foreign.C.Types doesn't have analogues for these, so those are a challenge for FFI already because of that).

131 shows that just setting the target triple is not enough, there are some differences in standard headers.

edsko commented 5 days ago

We are not really implementing (1). We hard code offsets and alignment for structs for example. For (1) one would really need to generate hsc2hs code or "formulas" for size and alignment.

Yes, that's why I emphasize "API" (as opposed to implementation). Perhaps I should be clearer about that.

My main thinking why this is OK is that if we say, CInt, then this type itself provides the same amount of ambiguity: you don't know what its size is. Therefore picking a specific implementation (in terms of Storable instances, for example) is compatible with that: it's "implementation defined" after all. This way we keep the types the same, but the implementation differs. I think this is what most users would anyway expect?

I don't think that (2) even makes sense for TH generation, TH is run on the target architecture anyway

"TH is run on the target achitecture anyway" -- that would be nice, but not actually the case with current ghc is it?

This doesn't feel right. I think that if we do (2) and libclang says "unsigned 4 byte integer" then we should use Word32 and not try to figure out which C-type on the host is the same size as the target type. In other words, be as explicit as possible.

Yes, that's fair enough, if we do do 2, then indeed, we should be explicit.

For (1) we also need https://github.com/well-typed/hs-bindgen/issues/134, (..)

Yes, we should definitely not always look through typedefs; I'm currently working on that.

phadej commented 5 days ago

think this is what most users would anyway expect?

I don't understand, so are you saing that

data StructFoo = MkStructFoo
  { field1 :: CInt
  , field2 :: CInt  -- ^ system independent types
  }

instance Storable StructFoo where
  sizeOf _ = 64  -- 32 bit system specific value
  ...

is fine?

IMO it isn't. Either it's

(1)

data StructFoo = MkStructFoo
  { field1 :: CInt
  , field2 :: CInt
  }

instance Storable StructFoo where
  sizeOf _ = #{sizeof struct foo} -- or some formula, like sizeof_ @CInt + sizeof_ @CInt -- but also taking alignment into account.

or

(2)

data StructFoo = MkStructFoo
  { field1 :: Word32
  , field2 :: Word32
  }

instance Storable StructFoo where
  sizeOf _ = 64
  • that would be nice, but not actually the case with current ghc is it?

It is. TH is always run on the target, in a way or another (even GHCJS etc). The current multi-staged setup won't work otherwise. (TL;DR the Int is the same on all stages, there are no host Int and target Int) (EDIT: Maybe there are some cross-compilation workarounds in use where people run TH code on host, pretending it's on target - but that can easily cause problems. I can share examples privately)

edsko commented 5 days ago

Why isn't it fine? CInt is defined to be implementation defined; here, there is one such implementation. I don't see a conflict here.

Option (1), generating esentially .hsc code, is explicitly not what the client wants: hs-bindgen should do the resolution, and it should not depend on invoking a C compiler. We could maybe offer this as choice, but it would strictly be an enhancement that we choose to implement.

I don't understand re TH, but yes, we don't need to discuss this in this ticket.

phadej commented 5 days ago

The

data StructFoo = MkStructFoo
  { field1 :: CInt
  , field2 :: CInt  -- ^ system independent types
  }

instance Storable StructFoo where
  sizeOf _ = 64  -- 32 bit system specific value
  ...

is still system specific. Then there is really no difference between (1) and (2) if (1) means doing the above.

I couldn't do

hs-bindgen alib.h > ALib.hs

and commit that file to repository (i.e. run hs-bindgen before code distribution). ALib.hs will be system specific.


I think I understand, the interface of the module will appear to be system-independent in (1), but the implementation will be always specific to the target, whether it's (1) or (2).

In other words hs-bindgen will not generate system independent code, it's out of scope?

edsko commented 5 days ago

Yeah, I take your point. I think this needs some discussion with the client.

edsko commented 5 days ago

So I guess we have three modes:

  1. System independent API, system dependent implementation. The code that we generate (such as Storable instances) is specific to a specific platform (we know the size of CInt), but the API is system independent (we use CInt rather than Word64).
    • We would have to do "last minute code generation" (we couldn't just add in a single generated bindings to a package intended to be cross-platform), but
    • when programmer A writes code against this API on one machine, his code should ideally still work also when programmer B uses it on another machine (because the types don't allow machine specific choices).
  2. System dependent API, system dependent implementation. Unlike in (1), we also generate a machine specific API (Word64 instead of CInt).
    • This is primarily useful when writing code that uses a C library on a very specific device, for example when writing embedded code for a known platform.
  3. System independent API, system independent implementation (this probably involves generating .hsc instead of .hs files).

The mode that I had somewhat implicitly assumed we'd focus on first is (1), but indeed all three are valid. I thought from previous discussions with the client that (3) was not considered too desirable, but perhaps we should revisit this question. One difficulty with (3) is CPP: If the header uses CPP to make machine dependent choices, then it becomes unclear what we should do; in a way, option (3) implies #72, at least to some degree.