whatyouhide / stream_data

Data generation and property-based testing for Elixir. šŸ”®
https://hexdocs.pm/stream_data
886 stars 66 forks source link

Generators that generate unique values #173

Open zachdaniel opened 2 years ago

zachdaniel commented 2 years ago

At the moment, I see uniq_list_of/2 that allows for taking a generator and producing a unique list of items from that generator. But is there anything like maps |> uniq_by(&Map.take(&1, [:a, :b])) |> ... that ensures that no duplicate values are ever produced? It would need to also not reuse any values for shrinking, so perhaps would be unshrinkable (I'm out of my depth there).

What I'm looking for:

%{a: integer(), b: integer(), c: integer()}
|> fixed_map()
|> StreamData.uniq_by(&Map.take(&1, [:a, :b]))

The best I can find at the moment

%{a: integer(), b: integer(), c: integer()}
|> fixed_map()
|> StreamData.uniq_list_of(uniq_fun: &Map.take(&1, [:a, :b]))

The latter does not compose the way that I'm looking to compose the generator, as it produces lists.

whatyouhide commented 2 years ago

There's a big difference between uniq_list_of/2 and what you are looking for.

uniq_list_of/2 is a generator that generates list values where each list is "unique", that is, each generated list has zero duplicated elements. Generated lists are not unique "across lists", that is, the generator could generated [1, 2, 3] and then [1, 2, 3], which are two unique lists.

What you are looking for is a way for a generator to generate unique values across the generated values. It's a completely different implementation, as we need to keep a set of "seen" values when generating.

Since the generators inherently generate random data, this is not too easy to achieve. We can't deterministically generate a next unique value that hasn't been generated before: we need to generate the next value, check if it's in the "seen set", and re-generate it if it's there, and so on.

I'd love for you to take a stab at this if you want and Iā€™m happy to provide assistance.

When it comes to "shrinkability", you're probably right, this would end up being an unshrinkable generator. Making sure that the whole lazy tree (the data structure we use for shrinking) is unique might be a challenge. šŸ™ƒ

zachdaniel commented 2 years ago

Yeah, definitely some interesting issues here, especially considering that uniq_list_of/2 is really just filtering a generator. So for some context here, what I'm doing is adding StreamData generators that produce valid input to an Ash Framework resource action. It isn't a hard requirement that each value returned by that stream can be used within the same transaction, but it would be very nice if it could be. Ash is "aware" of the various unique constraints on your resource, like email must be unique, and foo and bar must be unique together.

So here is what I'm wondering: could we potentially build in some concept of a "unique generator"? primarily because generating unique values for different types could involve different things, and they still might be shrinkable in some way. An example that comes to mind is that values could be "skipped" along the way, i.e shorter strings, lower integers, and potentially be reserved for shrinking (unless the stream runs out or something, in which case they'd need to be used for generating more values).

Ash could use this under the hood by just ensuring that (in the above example) that email comes from unique_string() and foo and bar come from unique_map(...).

Here is the kind of thing I have in mind:

def unique_integer(left..right = _range) do  
  # An implementation that does something like using `System.monotonic_integer`, to only produce increasing values
  new(...., unique: true)
end

We could store on the stream that it is capable of producing unique values each time. And then we could have unique streams only compose with other unique streams, i.e

unique_map(%{id: unique_integer()})

As I'm typing this I'm honestly not sure if its just a crazy-bad idea or what šŸ˜† . unique_map may also be a bit strange... I guess all it would really do is require that the generators are unique generators, so possibly no point in having it?

I'm not suggesting this is the solution I need/want, but in my (somewhat limited) experience with property testing, avoiding filters will drastically increase your quality of life, and using a filter to get unique results would, I imagine, be a pretty big performance hit and would make the rate of "too many filtered values" significantly higher.