Closed Warbo closed 3 months ago
As you already hint at, falsify
is very much designed with lazy generation in mind; that's how it can generate infinite data structures such as functions, for example. Only the part of the value that the test needs is actually generated.
I think a pure and idiomatic way to do what you sketch above is something like this:
data Value a = Defined a | Undefined
instance Show a => Show (Value a) where
show (Defined x) = show x
show Undefined = "_"
fromValue :: Value a -> a
fromValue (Defined x) = x
fromValue Undefined = undefined
fromValues :: [Value a] -> [a]
fromValues = map fromValue
Then write a regular generator that can shrink Defined
values to Undefined
; that way, if the function under test never looks at the value, it can definitely shrink to Undefined
and you get the behaviour you want, I tihnk.
Let me know how that works out; if it does work and it's useful, a PR would be great! :)
I've played around a bit, and indeed counterexamples are as lazy as possible; it's only the printing that forces values. For example, Interactive.shrink
will return counterexamples containing undefined
without issue; whilst the tasty driver triggers an exception since it tries to print them.
Whilst falsify
does a good job of generation and shrinking, I think the functionality I'm asking for is orthogonal (showing which parts of a value are unevaluated), and not really specific to a test framework. For example, ChasingBottoms provides something similar (though it only distinguishes bottom/non-bottom; so doesn't quite solve the problem), and both it and LazySmallcheck rely on custom SYB-style alternatives to Show
.
Another avenue to explore would be generalising counterexamples, like SmartCheck or extrapolate
. Again, I think that's mostly orthogonal; or at least relies on more bespoke type classes (e.g. Express
) so isn't appropriate to impose on users.
What is lazy generation
LazySmallcheck2012 (original, and my fork that works on newer GHCs) behaves in a way where data is only generated when the test forces its evaluation. (I just checked the original lazysmallcheck package, and it doesn't seem to behave this way.)
For example, consider this simple test in falsify:
This is asserting that every
lst :: [Int]
with more than three elements also has a first element that is at least10
. It fails (as expected):In this case falsify reports a minimal counterexample of
[0, 0, 0, 0]
. Whilst that's the smallest fully-normalised value, it is actually over-specified: the test never forces the last three elements, so they are irrelevant to the outcome (only the spine and the first element are relevant). If we compare this to the same check in LazySmallcheck2012 (in GHCi):It gives
-3:_:_:_:[]
, AKA[-3, _, _, _]
. Here_
indicates a value which was never forced, so never ran its generator, and is hence irrelevant to the cause of the test failure. Unfortunately in this case the first value (-3
) isn't minimal, since (Lazy)Smallcheck doesn't do any shrinking; the ideal counterexample would be[0, _, _, _]
.Why lazy generation
There are a few reasons this behaviour is nicer, as a user, than seeing fully normalised counterexamples:
0
, are ambiguous: sometimes it means they can be ignored, but sometimes there is significance to that value. Using_
for the former removes some ambiguity._
indicates a problem with our control flow_
instead of a particular value avoids accidental coincidences that we can waste time ruling out, e.g.[0,0,0,0]
contains duplicate elements, but that's just a coincidence and isn't related to the failure.Bool
can already be proved by exhaustion; lazy generators complement this, since a property which passes without forcing some part of its input must therefore be true for all values of that part.There may be performance benefits, by avoiding a bunch of generators running; but I haven't found that particularly noticable.
How LazySmallcheck does it
As far as I know, LazySmallcheck2012 works by starting with all inputs as exceptions, and retrying a test with more refined inputs if any of those exceptions are thrown (until some depth limit is reached). In the above example it might have run the property on an input like
throw A
, which causes the exceptionA
to be thrown; so it ran again with[]
(which passed) andthrow A : throw B
(which throwsB
). The latter is refined tothrow A : []
(which passed) andthrow A : throw B : throw C
(which throwsC
) and so on untilthrow A : throw B : throw C : throw D : []
finally satisfies thelength
check, and causes thehead
check to throwA
; causing it to be retried with an input like-3 : throw A : throw B : throw C : []
which finally causes the test to fail, and that is our counterexample (withthrow X
shown as_
).How falsify might do it
Reading through https://well-typed.com/blog/2023/04/falsify/ it appears like falsify's sample trees could be used for lazy generation. From my understanding, each generator gets its own sub-tree; each node contains a PRNG sample which a generator can use directly; and calls to other generators can be given the child subtrees. It seems to me that we can tell which parts of the input were forced by a test, by keeping track of which nodes of the sample tree have been forced.
Unfortunately I can't think of an elegant, pure way to do this. We could wrap each node in a thunk which has the side-effect of inserting an ID in a set, then freeze a copy of the set after running the test, to see which nodes were forced. Ugly, but workable.
Once we know which parts of the same tree were used/unused, we need to map that over to actual counterexamples, which would require running the generator again. We'd need a pretty-printing mechanism that's similar to LazySmallcheck's, i.e. capable of inserting a
_
in place of undefined values (e.g. by replacing the unused sample nodes with such printers, or exceptions, or something else).I'm sure someone could think of a more elegant approach than this, but I at least think it's (a) desirable behaviour and (b) plausible to implement, given the way (I think) falsify works.