well-typed / falsify

Other
38 stars 6 forks source link

Feature request: Don't generate unused values (a la LazySmallcheck2012) #70

Closed Warbo closed 3 months ago

Warbo commented 9 months ago

What is lazy generation

LazySmallcheck2012 (original, and my fork that works on newer GHCs) behaves in a way where data is only generated when the test forces its evaluation. (I just checked the original lazysmallcheck package, and it doesn't seem to behave this way.)

For example, consider this simple test in falsify:

prop_unneededElements :: Property ()
prop_unneededElements = do
  lst :: [Int] <- gen (Gen.list
                        (Range.between (0, 100))
                        (Gen.inRange (Range.between (0, 100))))
  if length lst > 3 && head lst < 10
     then testFailed "known failure, to check strictness"
     else return ()

This is asserting that every lst :: [Int] with more than three elements also has a first element that is at least 10. It fails (as expected):

Running 1 test suites...
Test suite falsify-test-test: RUNNING...
Playing
  Unneeded elements: FAIL (0.12s)
    failed after 28 successful tests and 5 shrinks
    known failure, to check strictness
    Logs for failed test run:
    generated [0,0,0,0] at CallStack (from HasCallStack):
      gen, called at test/Main.hs:18:19 in main:Main

    Use --falsify-replay=010985dabce2ff4eb5c78d1e70747b669b to replay.

1 out of 1 tests failed (0.13s)

Test suite falsify-test-test: FAIL

In this case falsify reports a minimal counterexample of [0, 0, 0, 0]. Whilst that's the smallest fully-normalised value, it is actually over-specified: the test never forces the last three elements, so they are irrelevant to the outcome (only the spine and the first element are relevant). If we compare this to the same check in LazySmallcheck2012 (in GHCi):

*LSC2012> test (\(lst :: [Int]) -> not (length lst > 3 && head lst < 10))
LSC: Depth 0:
LSC: Property holds after 3 tests.
LSC: Depth 1:
LSC: Property holds after 5 tests.
LSC: Depth 2:
LSC: Property holds after 7 tests.
LSC: Depth 3:
LSC: Property holds after 9 tests.
LSC: Depth 4:
LSC: Counterexample found after 12 tests.
Var 0: -3:_:_:_:[]
*** Exception: ExitFailure 1

It gives -3:_:_:_:[], AKA [-3, _, _, _]. Here _ indicates a value which was never forced, so never ran its generator, and is hence irrelevant to the cause of the test failure. Unfortunately in this case the first value (-3) isn't minimal, since (Lazy)Smallcheck doesn't do any shrinking; the ideal counterexample would be [0, _, _, _].

Why lazy generation

There are a few reasons this behaviour is nicer, as a user, than seeing fully normalised counterexamples:

There may be performance benefits, by avoiding a bunch of generators running; but I haven't found that particularly noticable.

How LazySmallcheck does it

As far as I know, LazySmallcheck2012 works by starting with all inputs as exceptions, and retrying a test with more refined inputs if any of those exceptions are thrown (until some depth limit is reached). In the above example it might have run the property on an input like throw A, which causes the exception A to be thrown; so it ran again with [] (which passed) and throw A : throw B (which throws B). The latter is refined to throw A : [] (which passed) and throw A : throw B : throw C (which throws C) and so on until throw A : throw B : throw C : throw D : [] finally satisfies the length check, and causes the head check to throw A; causing it to be retried with an input like -3 : throw A : throw B : throw C : [] which finally causes the test to fail, and that is our counterexample (with throw X shown as _).

How falsify might do it

Reading through https://well-typed.com/blog/2023/04/falsify/ it appears like falsify's sample trees could be used for lazy generation. From my understanding, each generator gets its own sub-tree; each node contains a PRNG sample which a generator can use directly; and calls to other generators can be given the child subtrees. It seems to me that we can tell which parts of the input were forced by a test, by keeping track of which nodes of the sample tree have been forced.

Unfortunately I can't think of an elegant, pure way to do this. We could wrap each node in a thunk which has the side-effect of inserting an ID in a set, then freeze a copy of the set after running the test, to see which nodes were forced. Ugly, but workable.

Once we know which parts of the same tree were used/unused, we need to map that over to actual counterexamples, which would require running the generator again. We'd need a pretty-printing mechanism that's similar to LazySmallcheck's, i.e. capable of inserting a _ in place of undefined values (e.g. by replacing the unused sample nodes with such printers, or exceptions, or something else).

I'm sure someone could think of a more elegant approach than this, but I at least think it's (a) desirable behaviour and (b) plausible to implement, given the way (I think) falsify works.

edsko commented 9 months ago

As you already hint at, falsify is very much designed with lazy generation in mind; that's how it can generate infinite data structures such as functions, for example. Only the part of the value that the test needs is actually generated.

I think a pure and idiomatic way to do what you sketch above is something like this:

data Value a = Defined a | Undefined

instance Show a => Show (Value a) where
  show (Defined x) = show x
  show Undefined = "_"

fromValue :: Value a -> a
fromValue (Defined x) = x
fromValue Undefined = undefined

fromValues :: [Value a] -> [a]
fromValues = map fromValue

Then write a regular generator that can shrink Defined values to Undefined; that way, if the function under test never looks at the value, it can definitely shrink to Undefined and you get the behaviour you want, I tihnk.

Let me know how that works out; if it does work and it's useful, a PR would be great! :)

Warbo commented 3 months ago

I've played around a bit, and indeed counterexamples are as lazy as possible; it's only the printing that forces values. For example, Interactive.shrink will return counterexamples containing undefined without issue; whilst the tasty driver triggers an exception since it tries to print them. Whilst falsify does a good job of generation and shrinking, I think the functionality I'm asking for is orthogonal (showing which parts of a value are unevaluated), and not really specific to a test framework. For example, ChasingBottoms provides something similar (though it only distinguishes bottom/non-bottom; so doesn't quite solve the problem), and both it and LazySmallcheck rely on custom SYB-style alternatives to Show. Another avenue to explore would be generalising counterexamples, like SmartCheck or extrapolate. Again, I think that's mostly orthogonal; or at least relies on more bespoke type classes (e.g. Express) so isn't appropriate to impose on users.