well-typed / cborg

Binary serialisation in the CBOR format
https://hackage.haskell.org/package/cborg
188 stars 86 forks source link

clarify need for encodeListLen #152

Open elaforge opened 6 years ago

elaforge commented 6 years ago

This is a question more than an issue, please let me know if there's a more appropriate place to ask. To make it an actionable issue maybe we could say it's a request to extend the docs to answer it.

In switching from cereal/binary It's looks like cborg adds an additional requirement, which is the encodeListLen, as documented in the tutorial. But it seems like this is just another thing to manually add, and keep up to date, and possible get wrong. I assume it's needed by the underlying CBOR encoding, but it seems like we can do it automatically by reusing the tuple instance. So instead of the tutorial recommended:

instance Serialise Lilypond.StaffConfig where
    encode (Lilypond.StaffConfig a b c d e) = encodeListLen 5 <> encodeTag 0
        <> encode a <> encode b <> encode c <> encode d <> encode e
    decode  = do
        len <- decodeListLen
        tag <- decodeTag
        case (len, tag) of
            (5, 0) -> do
                long :: Lilypond.Instrument <- decode
                short :: Lilypond.Instrument <- decode
                code :: [Text] <- decode
                display :: Bool <- decode
                add_bass :: Bool <- decode
                return $ Lilypond.StaffConfig long short code display add_bass
            _ -> Serialise.bad_version "Lilypond.StaffConfig" tag

I can write:

instance Serialise Lilypond.StaffConfig where
    encode (Lilypond.StaffConfig a b c d e) = encodeTag 0
        <> encode (a, b, c, d, e)
    decode  = do
        tag <- decodeTag
        case tag of
            0 -> do
                ( long :: Lilypond.Instrument
                    , short :: Lilypond.Instrument
                    , code :: [Text]
                    , display :: Bool
                    , add_bass :: Bool
                    ) <- decode
                return $ Lilypond.StaffConfig long short code display add_bass
            _ -> Serialise.bad_version "Lilypond.StaffConfig" tag

The output order will go [tag, length, ...] instead of the recommended [length, tag, ...], but the former seems to make more sense to me anyway, since surely the length will change depending on the tag. But perhaps CBOR requires the latter?

The decode has awkward indentation due to wanting to put type annotations on each of the fields, but in my experience I need type annotations on each field when I start to accumulate more versions.

So is this a reasonable way to do version-sensitive encoding and decoding? If so, is it better than the way recommended in the tutorial? Should I update the tutorial?

bgamari commented 6 years ago

I suppose you could do this. Is it better? I think that's a matter of aesthetics.

On a related note, you point out a glaring inaccuracy in the tutorial: the generic instances (correctly) represent constructor tags as CBOR words, not CBOR tags. The latter would be inappropriate here as they only apply to the CBOR term directly following and therefore would only "cover" the first field of a product. This should be fixed ASAP.

elaforge commented 6 years ago

I think it's not just aesthetics, because in the recommended pattern you still have that extra 5, which can go wrong in two ways: one is updated in one place but not the other, which will be caught by a round-trip test, and the other is consistent, but not the right number, which I think won't be caught by any test. You'll just be silently creating bad CBOR and might not notice until much later when a generic CBOR parser crashes on it (or maybe silently skips data, probably many decoders are not tested on invalid but well-formed data). And then added to that, re-using the tuple encoder even accidentally caught another bug in the example, which would have been another thing round-trip tests wouldn't catch, but you might find out much later when you try to parse it using another implementation. Also, from reading the tutorial and not knowing the details of CBOR, I had to go check the spec to see what is the required order of length and tag... I would have been tempted to do that even if the bug had not been present, because I always want to know when you have some order, what happens when the order is the other way?

I would also argue that anyone needing backward compatibility will be using the by-hand method, due to versions. That implies they are saving to permanent storage, which makes it all the more important to not have a buggy encoder. Also, from personal experience, these instances are the ones that are frequently updated by hand, but in a semi-mechanical way, which is a predictable source of bugs. My personal style with redundant type declarations evolved due to such bugs, and has continued to catch them.

So surely if style 2 eliminates at least 3 bug opportunities in addition to manual bookkeeping, and is syntactically shorter and (presumably) just as efficient, shouldn't it be preferred? Especially for a tutorial which is what everyone is going to copy paste into their project before they really understand what's going on.

And then... I started updating the tutorial text to fix the list len, tag order problem, and noticed that it looks like the encodeListLens are actually all off by one: encodeTag 0 <> encodeListLen 3 <> encode name <> encode height. Shouldn't that be 2? If I encode (name, height) I get 2. If 2 is in fact correct, then this is exactly the bug I was worrying about... and presumably written by someone who knows all about CBOR, not the confused beginner reading it.

dcoutts commented 6 years ago

Yes it really must be

encodeListLen 3 <> encodeWord 0 <> encode name <> encode height

and not

encodeTag 0 <> encodeListLen 3 {- or 2 -} <> encode name <> encode height

The list bit is required for CBOR encoding, it's encoding the tuple (0, $name, $height) which has 3 components. Using the work tag here is confusing, since CBOR has a notion of tag which is different. Let me call it the constructor number. The constructor number is necessary for multi-constructor types and highly recommended for single-constructor types because it allows for backwards compatibility.

Why not re-use the tuple instance and do encodeWord 0 <> encode (name, height)? Well because that's not well-formed CBOR (you can't just have two items adjacent, they need to be in a CBOR list or map). It'd have to be encode (0, name, height). Why not recommend this? Well I don't think it generalises so nicely, and it's also probably a bit confusing itself.

axman6 commented 5 years ago

I don't understand what's confusing about recommending people use

... = encode (0, fieldA1, fieldA2)
... = encode (1, fieldB1, fieldB2, fieldB3)

or even having something like

encodeConstr :: EncodableTuple a => Int -> a -> Encoding

Which only has instances for tuples so users can write:

encode (A x y)   = encodeConstr 0 (x,y)
encode (B a b c) = encodeConstr 1 (a,b,c)

which clearly calls out which constructor maps to which constructor tag.

It stops the user from ever getting the encodeListLen wrong, it's an obvious translation of product types which reduces one place for problems (it doesn't help in the decoder, but making the encoder harder to get wrong will make writing the decoders simpler as you'll know they are 'correct').

Personally I feel the current way to write encoders is far too error prone, but it does have the advantage of allowing the user to be pretty precise about the encodings used; the above would rely on using the default encoding for each of the members of the tuples, which in my experience with Aeson can cause a lot of problems.