whatyouhide / stream_data

Data generation and property-based testing for Elixir. 🔮
https://hexdocs.pm/stream_data
880 stars 66 forks source link

How to pass generators without materializing them #152

Closed TylerPachal closed 3 years ago

TylerPachal commented 3 years ago

I have the following generator for creating english-like text:

def text() do
  characters =
    frequency([
      {4, choose(~w(b c d f g h j k l m n p q r s t v w x z))},
      {1, choose(~w(a e i o u y))}
    ])

  list_of(frequency([
    {50, characters},
    {5, constant("\s")},
    {5, choose([". ", "-", "! ", "? ", ", "])},
    {5, string(?0..?9, length: 1)},
    {1, constant("\n")}
  ]), min_length: 1)
  |> map(&Enum.join/1)

end

It works fine, but now I want to introduce an extra "layer" so that I might get Japanese text instead. I just want to make the characters variable, but I cannot quite figure out how. Naively I did the following:

def text() do
  english_characters =
    frequency([
      {4, choose(~w(b c d f g h j k l m n p q r s t v w x z))},
      {1, choose(~w(a e i o u y))}
    ])

  japanese_characters =
    integer(12352..12543)
    |> map(fn cp -> List.to_string([cp]) end)

  characters = 
    frequency([
      {9, english_characters},
      {1, japanese_characters}
    ])

  list_of(frequency([
    {50, characters},
    {5, constant("\s")},
    {5, choose([". ", "-", "! ", "? ", ", "])},
    {5, string(?0..?9, length: 1)},
    {1, constant("\n")}
  ]), min_length: 1)
  |> map(&Enum.join/1)

end

This kind of works, but I do not want English combined with Japanese; I want all English, or all Japanese.

Is it possible to use frequency() to return a generator without materializing it? Or perhaps do I need to use bind? I also tried wrapping the english_characters and japanese_characters in anonymous functions but that did not quite work either.

TylerPachal commented 3 years ago

This seems to work:

def text() do
  english_characters =
    frequency([
      {4, choose(~w(b c d f g h i j k l m n p q r s t v w x z))},
      {1, choose(~w(a e o u y))}
    ])

  japanese_characters =
    integer(12352..12543)
    |> map(fn cp -> List.to_string([cp]) end)

  frequency([
    {9, text(english_characters)},
    {1, text(japanese_characters)}
  ])
end

defp text(charcter_gen) do
  list_of(frequency([
    {50, charcter_gen},
    {5, constant("\s")},
    {5, choose([". ", "-", "! ", "? ", ", "])},
    {5, string(?0..?9, length: 1)},
    {1, constant("\n")}
  ]), min_length: 1)
  |> map(&Enum.join/1)
end

Is that the best way to achieve this?

whatyouhide commented 3 years ago

Yes, the second way is what you want, because you want to have a list of frequencies of characters/punctuation/spaces with a fixed character set. In your first example, every time character is picked it will be either English or Japanese so yes, they'd be mixed.