ybisk / charNMT-noise

Scripts and noise data for Belinkov & Bisk 2018
29 stars 8 forks source link

The meaning of distribution #2

Closed sh0416 closed 1 year ago

sh0416 commented 3 years ago

Hi,

I wonder the meaning of "distribution" in config. I read the code "scrambler.py" to understand the mechanism of them, but I couldn't figure it out. The following list is what I know.

Could you give me some insight how to operate that code? Thanks,

sh0416 commented 3 years ago

I notice that the branch decide some sampling scheme with/without replacement for noise operation.

Then, what is the configuration for "distribution" used in the experiment?

ybisk commented 3 years ago

Hi! Sorry, you're pushing the bounds of my memory :) I believe the logic is roughly the following.

  1. If the distribution field is a valid distribution whose fields sum to 1, then for each sentence is we sample one of the scrambling methods to apply
  2. if the values in the array sum to greater than 1, then it's interpreted as a probability per scrambling type, so the same sentence is appended multiple times with each entry having some scrambling applied with the corresponding probability -- this means you end up with more output sentences than input
  3. if the values sum to less than 1, then we are only applying randomness to a subset of the data so we apply the corresponding combination of scrambles up to a limit

This is based on a little bit of memory and a little bit of reading, but I probably would need to rerun some of the scripts to guarantee I got this behavior right.

Hope that helps!