Closed bojone closed 4 years ago
If I understand correctly, you are referring to the self-BLEU. Actually, I opened an issue #27 on Texygen about the self-BLEU metric.
No, it is not self bleu.
the bleu in your work is something like
np.mean([
bleu(references=the_whole_test_data, hypothesis=s)
for s in the_whole_generated_data
])
it can be a metric of generated reality.
my idea is to calculate
np.mean([
bleu(references=the_whole_generated_data, hypothesis=s)
for s in the_whole_test_data
])
as a metric of generated diversity, while high score means all the the_whole_test_data can be found in the_whole_generated_data.
Thanks for the explanation and now I see your point. I guess what you have proposed is basically the same with bleu, since the func bleu()
in our case actually calculates the mean of all the bleu scores between each reference and hypothesis, and you just swap the order of two for loops.
Approximately, the original one is to check whether if the_whole_generated_data
is a subset of the_whole_test_data
or not. And my idea is to check whether if the_whole_test_data
is a subset of the_whole_generated_data
or not.
If both of them are high, it means the_whole_generated_data ⊆ the_whole_test_data
and the_whole_test_data ⊆ the_whole_generated_data
, indicating the_whole_test_data = the_whole_generated_data
.
I have computed Self-BLEU which ensured that test data and reference data is the same. I guess that the issue #27 on Texygen does not happen for me. Because I do not reuse the saved "references" in SelfBleu Class.
For COCO, I saved 1,000 sentences and compute Self-BLEU-2 at each epoch. After pretraining, Self-BLEU-2 was around 0.76. After adversarial training for about 10 epochs (3130 iters), Self-BLEU-2 rise to about 0.85.
Hmm, this is interesting. Could you please share your code to calculate the self-BLEU score? Thanks!
In your article, you use the whole test data as reference then calculate the BLEU of each generated sentence. The average of them can be a metric of generated reality.
Conversely, why not use the whole generated data (the same number as test data) as reference then calculate the BLEU of each test sentence. The average of them can be a metric of generated diversity.