Avoid unnecessary copies when marshaling/unmarshaling OCaml objects from/to void *

denis631 commented 2 years ago

I am working on a hobby database project in OCaml add currently am adding the wiredtiger C bindings to it. (wiredtiger is a storage engine)

I stumbled upon the issue of marshaling/unmarshaling of the OCaml objects. The idea would be to write plain OCaml objects to disk as an array of bytes and read the OCaml objects directly from disk.

However, I don't know how to do it efficiently without performing unnecessary copies.

E.g. currently in order to write data to disk I first get the OCaml object representation, then marshal it into bytes (1st copy), then I need to map OCaml bytes into char CArray or something similar. However, coerce method failed for me (to cast OCaml uchar pointer to C uchar pointer), this is why I am allocating a new CArray instance (2nd copy). However, I think no copy is needed at all, as I can write to the disk the current view of the object, which means 0 copies instead of 2. (write the current object address in memory and its length).

When reading the data from disk I need to map the void * data to OCaml object. Unfortunately, I can not create a bytes from void *, this is why I make a CArray instead. Luckily this operation doesn't involve any copies. However I need to create bytes out of it, so I call Bytes.init size f (1st copy). And then when unmarshaling the OCaml object from bytes, new memory is allocated (2nd copy). In this scenario, one copy should suffice, by unmarshaling the OCaml object directly from void *.

How can I implement this without the unnecessary cloning of the data? Thank you very much in advance 🙏

PS: When inserting tuples I am marshaling OCaml object to bytes first here: https://github.com/denis631/LegoDB/blob/6fd39397e61d51ca1a2115ce8f7dd7b2b5cd0666/src/storage/table.ml#L45 When I retrieve the data from the disk I unmarshal the data: https://github.com/denis631/LegoDB/blob/6fd39397e61d51ca1a2115ce8f7dd7b2b5cd0666/src/storage/table.ml#L28

The conversions:

bytes -> WT_ITEM and
WT_ITEM -> bytes can be found here: https://github.com/denis631/LegoDB/blob/6fd39397e61d51ca1a2115ce8f7dd7b2b5cd0666/src/storage/wired_tiger/wired_tiger.ml#L436-L449 WT_ITEM is a Wiredtiger abstraction that represent an object written on disk: https://source.wiredtiger.com/2.9.3/group__wt.html#struct_w_t___i_t_e_m

fdopen commented 2 years ago

parmap contains a module (Bytearray) that allows to marshal directly into a bigarray (and back): https://github.com/rdicosmo/parmap/blob/db44dc9cf7a6af7b56d8ebda8c75be3375c89282/src/bytearray.mli#L42-L46

Ctypes provides helpers provided to get a pointer from the bigarray: https://github.com/ocamllabs/ocaml-ctypes/blob/acf2e352b8e36804b8b35d96e3962b894c5cd0e7/src/ctypes/ctypes.mli#L348-356

denis631 commented 2 years ago

Thank you very much @fdopen! 🙏 By applying your suggestions I managed to half the ingestion of tpcc_customer string data (83MB) from 6.4 to 2.9 seconds. Over 50% faster!

yallop / ocaml-ctypes

Avoid unnecessary copies when marshaling/unmarshaling OCaml objects from/to void * #710