my idea of metric on diversity.

weilinie / RelGAN

Implementation of RelGAN: Relational Generative Adversarial Networks for Text Generation

MIT License

119 stars 31 forks source link

my idea of metric on diversity. #6

Closed bojone closed 4 years ago

bojone commented 5 years ago

In your article, you use the whole test data as reference then calculate the BLEU of each generated sentence. The average of them can be a metric of generated reality.

Conversely, why not use the whole generated data (the same number as test data) as reference then calculate the BLEU of each test sentence. The average of them can be a metric of generated diversity.

weilinie commented 5 years ago

If I understand correctly, you are referring to the self-BLEU. Actually, I opened an issue #27 on Texygen about the self-BLEU metric.

bojone commented 5 years ago

No, it is not self bleu.

the bleu in your work is something like

np.mean([
    bleu(references=the_whole_test_data, hypothesis=s)
    for s in the_whole_generated_data
])

it can be a metric of generated reality.

my idea is to calculate

np.mean([
    bleu(references=the_whole_generated_data, hypothesis=s)
    for s in the_whole_test_data
])

as a metric of generated diversity, while high score means all the the_whole_test_data can be found in the_whole_generated_data.

weilinie commented 5 years ago

Thanks for the explanation and now I see your point. I guess what you have proposed is basically the same with bleu, since the func bleu() in our case actually calculates the mean of all the bleu scores between each reference and hypothesis, and you just swap the order of two for loops.

bojone commented 5 years ago

Approximately, the original one is to check whether if the_whole_generated_data is a subset of the_whole_test_data or not. And my idea is to check whether if the_whole_test_data is a subset of the_whole_generated_data or not.

If both of them are high, it means the_whole_generated_data ⊆ the_whole_test_data and the_whole_test_data ⊆ the_whole_generated_data, indicating the_whole_test_data = the_whole_generated_data.

chenwq95 commented 5 years ago

I have computed Self-BLEU which ensured that test data and reference data is the same. I guess that the issue #27 on Texygen does not happen for me. Because I do not reuse the saved "references" in SelfBleu Class.

For COCO, I saved 1,000 sentences and compute Self-BLEU-2 at each epoch. After pretraining, Self-BLEU-2 was around 0.76. After adversarial training for about 10 epochs (3130 iters), Self-BLEU-2 rise to about 0.85.

weilinie commented 5 years ago

Hmm, this is interesting. Could you please share your code to calculate the self-BLEU score? Thanks!