yeslogic / fathom

🚧 (Alpha stage software) A declarative data definition language for formally specifying binary data formats. 🚧
Apache License 2.0
259 stars 14 forks source link

Ensure that type system is sound in the presense of arithmetic expressions #63

Closed brendanzab closed 4 years ago

brendanzab commented 6 years ago

At the moment our arithmetic expression syntax is very rich, combining the following:

I'm still a bit unclear as to how thos will interact with the other features of the language, like array sizes and pointer offsets. Do we want to think about overflow and overflow as well?

mikeday commented 6 years ago

A simple example is indexing an array with [i-1], does this imply a constraint that i > 0 that we check in advance, or do we just wait and to the bounds check just before indexing and abort if it fails.

brendanzab commented 6 years ago

It would be super nice if in the future we could cull lots of bounds checks at compile time. Our type system should be expressive enough to allow that, the tricky thing is making sure that we don't inadvertently throw safety out the door along the way.

brendanzab commented 6 years ago

Interesting references in the Puffs language's related work document.

Keywords mentioned in our meeting:

brendanzab commented 6 years ago

Reposting @markbrown's comment from email:

Here's some formal rules for bounds checking of integers. First, some preliminaries:

Expressions

e ::=  x                    Variable
    |  k                    Constant
    |  -e                   Negate
    |  e1 + e2              Plus
    |  e1 - e2              Minus
    |  e1 * e2              Times
    |  e1 / e2              Divide

Environments map each variable to a range

Γ ::=  .  |  x : l .. u, Γ     (l ≤ u)

The following judgement says that in the environment Γ, the expression e has a value between l and u inclusive.

Γ ⊢ e : l .. u

The Var rule looks up the range in the environment. The Const rule says that constant k has range k .. k.

x : l .. u  ∈  Γ
---------------- (Var)
Γ ⊢ x : l .. u

-------------- (Const)
Γ ⊢ k : k .. k

For negation, the bounds are negated and also swapped. For addition we simply add the corresponding bounds. Subtraction is just a combination of these two rules.

Γ ⊢ e : l .. u
----------------- (Neg)
Γ ⊢ -e : -u .. -l

Γ ⊢ e1 : l1 .. u1    Γ ⊢ e2 : l2 .. u2
-------------------------------------- (Plus)
  Γ ⊢ e1 + e2 : l1 + l2 .. u1 + u2

Γ ⊢ e1 : l1 .. u1    Γ ⊢ e2 : l2 .. u2
-------------------------------------- (Minus)
  Γ ⊢ e1 - e2 : l1 - u2 .. u1 - l2

This is enough to show that, if i has the range 1 .. 10 then i - 1 has the range 0 .. 9, and can safely be used as an array index without checking the lower bound.

So far it should be pretty easy to see that the above rules are correct*. Times is a bit harder to explain, but if you draw the X-Y plane and sketch out various rectangles representing ranges on X and Y, you should be able to see that the maximum and minimum values of the product are always at corner points. So the easiest way to calculate the bounds is to calculate the four products corresponding to the corners of the rectangle, the find the minimum and maximum of that set. Viz:

Γ ⊢ e1 : l1 .. u1    Γ ⊢ e2 : l2 .. u2
Σ = {l1 * l2, l1 * u2, u1 * l2, u1 * u2}
---------------------------------------- (Times)
     Γ ⊢ e1 * e2 : min Σ .. max Σ

Hopefully you get the idea, so I'm going to leave division as an exercise. Note that you need to carefully define its behaviour for negative inputs. Also note that you don't need to worry about the bound in the case that the divisor is zero, since in this case the program will abort anyway. So, for example, if X and Y are in the range 0 .. 10 then X / Y will be in the range 0 .. 10, not 0 .. +inf.

We also add a Sub rule so that an expression can be used where a larger range is expected, such as in an array index.

Γ ⊢ e : l .. u    l' ≤ l    u ≤ u'
---------------------------------- (Sub)
        Γ ⊢ e : l' .. u'

In the Sub rule the Curry-Howard isomorphism comes into play, in that we get bits of program to correspond with our bits of proof. The thing is, in the target language the smaller range might fit into a byte, whereas the larger range might need to be larger, or signed. So we need a cast in the generated code. Essentially the parts of the proof where we use Sub will correspond with parts of the program where we might need a cast.

In a similar vein, we could add one last rule for implicit coercions. This is like Sub, except that the range becomes smaller instead of larger. Also like Sub, it would correspond to something in the generated program, except in this case the code would be a bounds check, not a cast. (But like it was said in the meeting, maybe we should require explicit checks at first.)

Γ ⊢ e : l .. u    l ≤ l'    u' ≤ u
---------------------------------- (Coerce)
        Γ ⊢ e : l' .. u'

It's worth noting that the method is really very incomplete, particularly if negative numbers and multiplication are involved. But for our application area I'm quite confident about the effectiveness we'll get from it.

brendanzab commented 6 years ago

Relevant Rust RFC: rust-lang/rfcs#671 in it looking at Sage was recommended by @kmcallister.