yaolu / Multi-XScience

Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
MIT License
42 stars 5 forks source link

hiersumm #3

Closed yclzju closed 3 years ago

yclzju commented 3 years ago

Hi, as for hiersumm, how did you prepare your data, did you use his sentencepiece model, how did you encode your text, I noticed he didn't release the code for preparing data. Thanks in advance. Also for evaluation, do you think which one is better, replace cite_2 with @cite or just remove cite_2.

yaolu commented 3 years ago

I use their sentencepiece model for the encoding. You can mimic their format, it is not difficult to do so. You can take a look at this https://github.com/google/sentencepiece about how to use it.

For evaluation, just replace @cite_2 with @cite.

yclzju commented 3 years ago

Hi, I refer to sentencepiece, and use the sentencepiece model as he released in hiersumm. But when I try to find his preprocess process in MultiWiki dataset, I find it works when I try to convert 'tgt_str' to 'tgt' with "spm.DecodeIds(example['tgt']).split(' ')". But I fails when I try to convert 'tgt_str' to 'tgt', it's quite different.( with spm.encode(example['tgt_str']) or spm.EncodeAsIds(example['tgt_str'])). Also I fint I can't convert back by spm.DecodeIds(spm.EncodeAsIds(example['tgt_str'])).split() I wonder how did you preprocess by sentencepiece, or maybe I have to give up his sentencepiece model or try another tokenizer

yaolu commented 3 years ago

Not sure if I understand your question. Here is my code snippet for the tgt_str to tgt transformation.

SOS = "<S>"
EOS = "</S>"
QOS = "<Q> "

tokenizer = load_tokenizer('spm9998_3.model')

def get_tgt(tgt_text_list: list, tokenizer):
    tgt_text_list = [' '.join(word_tokenize(tgt_text)) for tgt_text in tgt_text_list]
    tgt_str = SOS + " " + QOS.join(tgt_text_list) + EOS
    tgt = tokenizer.encode_as_ids(tgt_str)
    return tgt, tgt_str
yclzju commented 3 years ago

Thanks so much for your reply. I try to encode by this:

spm = sentencepiece.SentencePieceProcessor(model_file=vocab_path)
word_padding_idx = spm.PieceToId('<PAD>')
symbols = {'BOS': spm.PieceToId('<S>'), 'EOS': spm.PieceToId('</S>'), 'PAD': word_padding_idx,
    'EOT': spm.PieceToId('<T>'), 'EOP': spm.PieceToId('<P>'), 'EOQ': spm.PieceToId('<Q>')}
print(spm.EncodeAsIds(example['tgt_str']))
print(spm.decode(spm.encode(example['tgt_str'])))    #I find this is quite different from tgt_str

I will try your code, by the means, what is load_tokenizer as I can't find this in sentencepiece. Is word_tokenize from nltk?

yaolu commented 3 years ago

Try this.

import sentencepiece as spm

def load_tokenizer(spm_model_path):
    tokenizer = spm.SentencePieceProcessor()
    tokenizer.load(spm_model_path)
    return tokenizer
yclzju commented 3 years ago

Hi I try your code in dataset wiki in hiersumm, I meet the same problem as I meet with my own code. Can you decode back by tokenizer.decode(tgt) to tgt_str? In wiki dataset, I can decode tgt in dataset back to tgt_str But I can't get back by first encode tgt_str then decode. Here is my result: example['tgt_str']: "`` The Essential Marcia Hines '' is a compilation album released on 30 July 2007 by Australian singer Marcia Hines . </t> <t> It was released following Hines ' induction into the ARIA Hall of Fame on 18 July 2007 . </t> <t> The album contains five top 10 singles taken from the albums , Marcia Shines , Shining , Ladies and Gentlemen and Ooh Child ."

decode from example['tgt']: "<S> `` the essential marcia hines '' is a compilation album released on 30 july 2007 by australian singer marcia hines .<Q> it was released following hines ' induction into the aria hall of fame on 18 july 2007 .<Q> the album contains five top 10 singles taken from the albums , marcia shines , shining , ladies and gentlemen and ooh child .</S>"

first encode then decode: "<S> `` The ⁇ ssential Marcia ⁇ ines `` is a compilation album released on 30 ⁇ uly 2007 by ⁇ ustralian singer Marcia ⁇ ines .<Q> ⁇ t was released following ⁇ ines ' induction into the ⁇ ⁇ all of ⁇ ame on 18 ⁇ uly 2007 .<Q>The album contains five top 10 singles taken from the albums , Marcia ⁇ hines , ⁇ hining , ⁇ adies and ⁇ entlemen and ⁇ oh ⁇ hild .<Q>"

yaolu commented 3 years ago

Hi, I did not try to run hiersumm's wikisum dataset. In Multi-XScience project, the system decoding output looks good.

yclzju commented 3 years ago

Hi, you mean you can decode back from your encoded tokens by spm9998_3.model? My main problem is first encode then decode is quite different from initial str

yaolu commented 3 years ago

Yes, I can confirm that. But I didn't try to encode then decode back. I use the model output to decode to natural language. It should be the same.

I preprocess and convert multi-xscience dataset into the *.pt format required by hiersumm using spm9998_3.model. Then use the same command as hiersumm for training and decoding. The output is normal.

If you want to work on multi-xscience, I can provide you with the preprocessed *.pt file. If you want to work on hiersumm+wikisum, maybe you should ask hiersumm author about wikisum issue?

yaolu commented 3 years ago

Feel free to reopen.

shreyas1599 commented 3 years ago

Hi @yaolu I'm facing a similar problem. I first used the spm9998_3.model provided by the hierrsum author. In a lot of places the decoded output had '??'. It wasn't recognising the symbols. These are some of the symbols where the error occurred {^, ≥ ≤ © & ≃ ⨁ θ ’ – “ ” = } — Δ ∞ → ∈ ± °}. I passed these as user_defined_symbols while running spm.train() on your dataset and it decoded it properly to an extent. There are still several instances where it's not decoding properly. Could you provide me with a link of the .pt files of your dataset that you used to run hierrsum? Thanks.