whatyouhide / stream_data

Data generation and property-based testing for Elixir. 🔮
https://hexdocs.pm/stream_data
872 stars 66 forks source link

Collect and print test data statistics #42

Open alfert opened 7 years ago

alfert commented 7 years ago

In QuickCheck and PropEr, you can add functions to a property which collect statistical data about the generated data or add labels for classification of generated data. These information are printed on the screen after executing a property. They help to asses the quality of the generator and are useful to analyse whether relevant test cases/branches are executed with the generated data.

At least in PropEr, this is a feature requiring support of the property execution function since the results of a property could contain the statistical data. To calculate the statistics the result of a property must be fed into accumulation functions. The accumulator is printed after executing the property.

whatyouhide commented 7 years ago

I'm not sure what a good UI for this would look like. Take this example:

check all list <- list_of(integer(), min_length: 1),
          elem <- member_of(list) do
  assert elem in list
end

How would you print statistics for this? Would you print what was the distribution of list? What could you print about member that would be of interest?

Furthermore, would printing after the property executes be a really good visual interface? You'd have a bunch of dots for the tests you're running, and then this blob of info about a property, then more dots, and so on.

Thoughts?

alfert commented 7 years ago

The kind of relevant statistics depends usually on the properties or the system under test. In your example, I would be interested in the length of the list.

check all list <- list_of(integer(), min_length: 1),
      elem <- member_of(list) do
      assert elem in list 
      aggregate length(list)
end

It shows if the distribution of the list length in order to answer the question if enough and relevant data is generated for tests. If we do the same for maps and we see the generated maps have less then than 32 elements, then this would indicate that relevant parts of the map implementation are not reached (if I remember right, than maps are implemented as 32-trie or similar). It becomes more relevant if we generate recursive data structures. Let us generate a tree with StreamData.tree than I would be interested in the height of the tree, the amount of elements inside and/or the width to asses if the generated tree has desirable properties (e.g. a lefty tree is essentially a list, but a respectable tree at the same time: does it occur in the set of generated trees?). Another prominent situation is for executing state machines for testing stateful systems (which is not (yet?) the scope of stream_data). Here we generate command or event sequences and we are interested in coverage information: are all relevant states / commands / events reached?

Regarding the UI: From my perspective the statistics are primarily required during constructing the properties and fixing the bugs. Often, you will only run one property at a time. In this case, it is more than ok to have the statistics printed after the property log. If you run the properties in a longer test run, they become either as silent as a test log output (I don't want to see all the dots) or they appear in the (CI) test run log output, where they do no harm. In PropEr you can configure your own statistic writer, which may help to do a specific formatting or similar. It might be even useful to generate HTML or Markdown tables as usually done for code coverage reports. What I like in FsCheck für F# is the ability to add an arbitrary (interpolated) string to each collected measure which is used as the bucket name. Here you can use more human readable description of buckets. Examples are in the links below.

Real world examples and blog posts on these features are here:

whatyouhide commented 7 years ago

As I see it right now we have two routes to explore:

StreamData.check_all/3 functional approach

We can make StreamData.check_all/3 aware of this somehow. Since the function passed to check_all now can return {:ok, term}, we can return the property of the generated data we are interested in in term. For example:

check_all(list_of(integer()), [initial_seed: :os.timestamp()], fn list ->
  {:ok, %{aggregations: %{"length of list" => length(list)}}}
end)
#=> {:ok, %{aggregations: %{"length of list" => [{0, "15%"}, ...]}}}

This means that the aggregations can be returned in the final result of check_all/3. The check all macro can then introduce something like aggregate:

check all list <- list_of(integer()) do
  aggregate length(list)
  :ok
end

aggregate could be a macro and use the AST it's passed as the "name" for the aggregation. This solution doesn't look bad but it requires mangling in a few places. However, it has the great benefit of not having to do with property testing (ExUnitProperties specifically) so that you can use it outside of testing to verify that your generators are distributed how you want them to be.

Stateful solution with aggregator

We can have an external aggregator to gather data:

{:ok, aggregator} = Aggregator.start_link()

check all list <- list_of(integer()) do
  Aggregator.aggregate(aggregator, "length of list", length(list))
end

IO.inspect(Aggregator.fetch_aggregated(aggregator))

This has the advantage of being very easy to implement and completely perpendicular to generators and property testing. I also like that it has a clear "debugging" or even "meta" feel about it: after all, users are gonna use aggregated properties to check the goodness of their generators and their tests, so I like that this is more of a "drop-in" tool than the other solution.

Thoughts?

alfert commented 7 years ago

I prefer a functional approach, but without macros, if feasible. If you look at the F# examples (which are quite similar to propcheck and Proper), than we could enrich the last assert with a pipeline of aggregation functions:

check_all(list_of(integer()), [initial_seed: :os.timestamp()], fn list ->
  assert length(list) >= 0
 |> aggregate("length of list", length(list))
 |> classify("trivial", length(list) == 0)
 |> ok() 
end)

aggregate, collect, classify and the like take a either a boolean value (coming from assert) or a pair of a boolean value and list of triples {aggregator_name, label, value}. They return the pair of boolean and triple-list. The ok function simply converts the boolean value intro :ok. BTW, I personally would prefer that check_all would allow for a function to return either {boolean, any} or boolean: a property is either true or false.

Along this road, I would go for this: In check_all the aggregation takes place. The aggregator name/atom is required to find the proper implementation. This allows to expand check_all with aggregation functions defined elsewhere, as aggregation is something like an incremental reducer, which must obey a certain protocol. Printing the stats operate on the list of accumulators of the incremental reducers. It could be helpful to manage that stuff with proper structs, a protocol seems a little bit over the top, since I do not expect more than a handful aggregators at all.

The stateful approach looks nice at first glance and your thoughts about "debugging" is more than true. But to start an explicit aggregation server for such a task feels alien (and imperative). Providing a global server would require to identify each (concurrent) property execution automatically, which could destroy the elegance of your example.

What do you think?

josevalim commented 7 years ago

To be honest, I am not convinced about any. My favorite approach would probably be:

aggregate all list <- list_of(integer()) do
  %{list_length: length(list)
end

but it is unclear how feasible it is in practice to replace the check by aggregate.

Annotating the assertion would be too foreign for elixir unfortunately, since it requires the block to return the assertion and that's generally not how exunit assertions work.

alfert commented 7 years ago

This would not work, if you want to use several aggregations at the same time, which is usually the case when you classify the generated data e.g. into buckets like trivial, below threshold, above threshold.

I do not understand your comment regarding annotating the assertion. What I meant is to take the result of the (executed) assertion (i.e. true or false) and feed it into the aggregators to produce the {:ok, any} return value of check_all or of a property:

property "non negative list length" do
  check all list <- list_of(integer()) do
    assert length(list) >= 0
    |> aggregate("list length", length(list))
  end
end
josevalim commented 7 years ago

This would not work, if you want to use several aggregations at the same time

I see, thanks!

@alfert I do not understand your comment regarding annotating the assertion. What I meant is to take the result of the (executed) assertion (i.e. true or false)

assert raises when it fail, that's why I said it is probably not practical to compose on its return type.

alfert commented 7 years ago

@josevalim assert raises, that is very true. But if not, than it returns true. In other frameworks the result of a property is usually simply a boolean. In StreamData you get the nice error messages from assert which is a big plus IMO. So, what we loose would be stats of the last (i.e. failing) execution. In PropEr you get no stats at all when the property is failing, so I think this would be ok.

The aggregators could simply start from nothing and build a list of aggregations in their pipeline. Same approach, but one parameter less.

whatyouhide commented 7 years ago

@alfert returning true is something I really wish we could avoid if possible. It limits the properties to be written in a boolean fashion instead of naturally with assertions like all other ExUnit tests.

In PropEr you get no stats at all when the property is failing, so I think this would be ok.

Doesn't mean that if we can do something good we can avoid it because other frameworks avoid it :)

In general I think your approach is a bit far from the Elixir language and too close to FsCheck (as far as I know FsCheck). It feels very foreign in Elixir to me.

alfert commented 7 years ago

@whatyouhide Properties are by their very nature boolean statements. But the assert approach is very pragmatic and helpful, in particular for the error messages. Let's stick to it.

Did I get you right that you would prefer an approach like for plug, where the pipeline operator is handled inside the plug macro and is not visible for the programmer, as in:

pipeline :browser do
    plug :accepts, ["html"]
    plug :fetch_session
    plug :fetch_flash
    plug :protect_from_forgery
    plug :put_secure_browser_headers
    plug HelloWeb.Plugs.Locale, "en"
  end

Applying this idea, we could define a stats macro with a similar contract as plug calling statistical functions to calculate a new data point. The magic of stats is to append all these data points into a list or map, which is calculated and printed inside of check_all as discussed earlier. This could look like this:

check_all(list_of(integer()), [initial_seed: :os.timestamp()], fn list ->
  assert length(list) >= 0
  stats aggregate("length of list", length(list))
  stats classify("trivial", length(list) == 0) 
  stats classify("huge list", length(list) > 50)
end)

or perhaps even more in the plug way by using known functions:

check_all(list_of(integer()), [initial_seed: :os.timestamp()], fn list ->
  assert length(list) >= 0
  stats :aggregate, "length of list", length(list)
  stats :classify, "trivial", length(list) == 0
  stats :classify "huge list", length(list) > 50
end)

Optically, we replaced |> by stats, but it makes the intention clearer. Does this feel better?

whatyouhide commented 7 years ago

I meant to use an aggregate macro (maybe also classify but that's not the point right now) that would simply collect the data behind the scenes without the need of either stats (why have it?) nor the need to "hide a pipeline". There would be no pipeline and we wouldn't be augmenting the result of assertions, we would just be returning the aggregations from the check_all we compile to, just like in the check_all example I showed above.

alfert commented 7 years ago

@whatyouhide Could you please rewrite my last example in your way? I am not sure how I would have to formulate it properly. Perhaps I am thinking in a complete wrong direction. Thanks!

whatyouhide commented 7 years ago

Sorry I thought you were talking about check all and not StreamData.check_all/3. StreamData.check_all/3 would be written like I did originally, returning the aggregations in a %{aggregations: ...} map. You would do your calculations manually which is very fine IMO considering check_all/3 is not really supposed to be user-friendly but more a flexible foundation for check all. As far as check all goes, you would do:

check all list <- list_of(integer()) do
  aggregate "length of list", length(list)
  classify "trivial", length(list) == 0
  classify "huge list", length(list) > 50
  assert length(list) >= 0
end

or something of the sorts.

alfert commented 7 years ago

Ok, so if the property fails we have also the stats for failing property run. This is good!

I think that aggregate and classify are macros, otherwise they have no access to the map to be returned. Am I right?

whatyouhide commented 7 years ago

Yes they have to be macros. For what it's worth, assert can be anywhere in there (doesn't have to be the last expression) since we don't use its return value in any way in check all, we only use the fact that assert raised or not.

alfert commented 7 years ago

Correct, but when assert raises, the following commands are not executed. So, unless you reorder the statements in the property as part of the implementation , their order is significant. But's it is a good style to collect first the stats and then do the checking. The classical implementation strategy that stats collection is a wrapper of the property does not apply here.

whatyouhide commented 7 years ago

Correct, but when assert raises, the following commands are not executed

Right, I was missing that, sorry. So yeah we might have to do the aggregations as soon as possible in the test.

tmbb commented 7 years ago

Right, I was missing that, sorry. So yeah we might have to do the aggregations as soon as possible in the test.

If you want to collect them as soon as possible, you can collect stats before the do block. Like this:

property "something" do
    check all x <- gen1(), y <- gen2(), z <- gen3(),
      aggregate: [
        key1: stat1(x),
        key2: stat2(x),
        key3: stat3(y)
      ] do
      assert_something(x, y, z)
    end
  end

You get a nice AST out of this from this of course:

{:property, [],
 ["something",
  [do: {:check, [],
    [{:all, [],
      [{:<-, [], [{:x, [], Elixir}, {:gen1, [], []}]},
       {:<-, [], [{:y, [], Elixir}, {:gen2, [], []}]},
       {:<-, [], [{:z, [], Elixir}, {:gen3, [], []}]},
       [aggregate: [key1: {:stat1, [], [{:x, [], Elixir}]},
         key2: {:stat2, [], [{:x, [], Elixir}]},
         key3: {:stat3, [], [{:y, [], Elixir}]}]]]},
     [do: {:assert_something, [],
       [{:x, [], Elixir}, {:y, [], Elixir}, {:z, [], Elixir}]}]]}]]}

The main problem here is that it requires you to gather the statistics on the "raw" values generated by the generators. You might be more interested in some intermediate values. That's what I believe to be the hardest problem, collecting statistics as late as possible.

One way to gather statistics whenever you want is to gather them in the process dictionary and get them from there after the test case is run. If you use the process dictionary you don't even need macros; functions would be good enough.

It's probably better than to gather them in a different process, at least in this case. For raw generators (not properties), you can just tell the user to wrap it in a with_statistics function or macro that cleans up the values from the process dictionary and gives them back to the user.

edouardmenayde commented 6 years ago

Is there some progress on this ? If not can I or someone else help make this a thing ?