yg211 / summary-reward-no-reference

A reference-free metric for measuring summary quality, learned from human ratings.
https://arxiv.org/abs/1909.01214
Apache License 2.0
42 stars 4 forks source link

How to interpret the values #1

Open MichiOnGithub opened 5 years ago

MichiOnGithub commented 5 years ago

Thank you for this great contribution, I'm sure it will help developing RL summarization systems.

One thing I don't understand is how to interpret the values return from the rewarder. I'd assume that higher scores indicate higher-quality summaries. Running a few tests, the values are not what I expected:

rewarder = Rewarder(os.path.join('trained_models', 'sample.model'))
doc = '''Bilbo was very rich and very peculiar, and had been the wonder of the Shire for sixty years, ever since his remarkable disappearance and unexpected return. The riches he had brought back from his travels had now become a local legend, and it was popularly believed, whatever the old folk might say, that the Hill at Bag End was full of tunnels stuffed with treasure. And if that was not enough for fame, there was also his prolonged vigour to marvel at. Time wore on, but it seemed to have little effect on Mr. Baggins. At ninety he was much the same as at fifty.'''

    summ1 = '''Bilbo was very rich and at  age ninety as vigorous as at fifty.'''
    summ2 = '''Bilbo was very wealthy and peculiar, the riches he brought back from his journey made him a local legend. He was also very vigorous for his age.'''
    summ3 = '''The lord of Bag End is called Bilbo the Mighty and he is known for ruling the Shire with an iron fist.'''
    summ4 = '''Last weekend, a man died after a car crash.'''

    print(
        rewarder(doc, summ1),
        rewarder(doc, summ2),
        rewarder(doc, summ3),
        rewarder(doc, summ4)
    )
   outputs: -1.828371 -0.8733603 -0.02868136 -0.747489

Am I using it incorrectly or do I need to apply any kind of preprocessing beforehand? If this is the correct usage, is this just an unfortunate example / out of domain?

Also, when using a cpu for inference, the torch.load function in rewarder.py needs an additional parameter, as it defaults to cuda.

 self.reward_model.load_state_dict(torch.load(weight_path, map_location=torch.device(device)))

Kind Regards, Michael

yg211 commented 5 years ago

Hi Michael,

Thanks very much for your interests in our project :D

As far as I can see from you example code, you are using the code correctly. But the absolute values do not make much sense. You should use the values to derive the ranking of summaries. In your example case, the model ranks the four summaries as summ3 > summ4 > summ2 > summ1.

A bit more explanations: when we train the model, we push the model to give the correct ranking over each pair: e.g. suppose you have two summaries s1 and s2 for the same doc, and you know from the human ratings that s1 is better than s2, then during our training, we push the model to give higher score to s1 than s2. We have also tried to push the model to reproduce the human ratings (i.e. the regression loss), but that yields worse performance (find details on the paper).

As for using cpu for inference, you may refer to the answer at here: https://discuss.pytorch.org/t/loading-weights-for-cpu-model-while-trained-on-gpu/1032/2 Sorry about the inconveniences caused; we will add code to cover the cpu use case.

Best, Yang

egness commented 3 years ago

Hi Yang,

thanks for the details, just wanted ask if there are any updates on the regression task, i.e. reproducing the human ratings? I have a use case where I'd need to rate a summary with a normalised value r\in{0,1}.

Best!