whatyouhide / stream_data

Data generation and property-based testing for Elixir. 🔮
https://hexdocs.pm/stream_data
879 stars 66 forks source link

Improve the float generator #18

Closed tmbb closed 6 years ago

tmbb commented 7 years ago

I've never used floats on the BEAM, and I can't find the official reference. In any case, my opinion on generating floats (feel free to correct me if this doesn't make sense on the BEAM):

  1. There should be a float generator that can generate any float supported by the BEAM (even in the region where they're all integers)
  2. Floats should shrink towards positive zero (the float represented by the bits: << 0 :: size(64)>> ). Currently floats don't shrink. I think that StreamData should try to shrink almost everything. Either based on the value being shrunk (where it makes sense) or based on the bit representation (where it makes sense).
  3. The values 0.0, -0.0, the maximum possible float, the minimum possible float should be tested every time, unless the user chooses not to. The BEAM doesn't seem to support NaN, otherwise it should be generated too. While in lists I can accept that the empty list only has a high likelihood of being tested, I think that testing these cases every time is important for floats. The user should have to specifically disallow these "pathological" cases.
whatyouhide commented 7 years ago

As for point 1. and 2., I went with uniform_float/0 because it solves a lot of problems just by being composable with other generators. For example, to have a float generator that generates floats also outside of [0, 1] you can compose it with the int/0 generator:

StreamData.map({StreamData.uniform_float(), StreamData.int()}, fn {float, int} -> float * int end)

Values generated by the generator above will also shrink towards 0.0 because int/0 shrinks towards 0. I think that shrinking for floats is a hard problem because I'm not sure how you would shrink, that is, do you think 0.1 is simpler than 0.00001234903490384937498 even if the latter is "smaller" (<)?

As for 3., still not sure how to approach this but I'll think about it. It's a more general problem than just floats.

tmbb commented 7 years ago

I think that shrinking for floats is a hard problem because I'm not sure how you would shrink, that is, do you think 0.1 is simpler than 0.00001234903490384937498 even if the latter is "smaller" (<)?

Well, I think shrinking should be towards zero, for two reasons:

  1. Zero is the "simplest" real number
  2. The bitstring representation of zero is <<0.0::float>> = <<0, 0, 0, 0, 0, 0, 0, 0>>. From the point of view of the computer, this is the simplest value.

Thinking about how to shrink a float is harder. The way I'd do it would be by shrinking the bitstring representation lexicographically: (image from wikipedia):

Shrinking bit 63 (from 1 to 0) will switch the number from negative to positive, without changing the absolute value. Shrinking bits left to right from 62 onward will decrease the absolute value of the number. I'd shrink the bits left to right until the property doesn't hold. Then, I'd try a mix of shrinking right to left with some heuristics and maybe some backtracking. Other strategies are shifting bits, rotating bit segments, etc. Hypothesis uses a mix of these strategies to shrink bytes streams.

It would be a mix between binary search and some exploration of the search space, with the advantage that shrinking the bitstring representation is isomorphic to decreasing the float's absolute value.

So in my opinion, 0.00001234903490384937498 is simpler, because it has a lower absolute value, and is lexicographically smaller:

iex> <<0.1::float>>
<<63, 185, 153, 153, 153, 153, 153, 154>>
iex> <<0.00001234903490384937498::float>>
<<62, 233, 229, 214, 110, 254, 184, 252>>
tmbb commented 7 years ago

Also, from the point of view of floating point numbers, 0.1 is not simple at all. On the other hand, 0.5 is (it's a power of two). So actually, I think you have three criteria to decide whether which number is the simplest:

  1. By lexicographically comparing bitstrings. Equivalently, compare them by absolute absolute value, and in case of a tie decide that the positive number wins.
  2. By comparing the Hamming count of the numbers (the number of 1s in the binary representation), possibly by comparing them lexicographically if the number are tied. By this criterion, 0.5 is simpler than 0.1 (it fits better in the floating point number)
  3. By doing something funny like trying to look for the number that is "simpler" when represented in base 10. There might be a way of defining this rigorously using continued fractions or something like that, but it would be highly unintuitive, and I don't think you should go this way.

I think the first criterion is the one that feels more natural, but the second one also make sense. It's probably easier to shrink according to the second criterion: You only have to unset bits (you never have to set a 0 to a 1) or moving a 1rightwards. It's a smaller search space.

josevalim commented 7 years ago

Yup! I agree shrinking the underlying bytes is a good approach for floats.