Closed yclzju closed 3 years ago
I use their sentencepiece model for the encoding. You can mimic their format, it is not difficult to do so. You can take a look at this https://github.com/google/sentencepiece about how to use it.
For evaluation, just replace @cite_2 with @cite.
Hi, I refer to sentencepiece, and use the sentencepiece model as he released in hiersumm. But when I try to find his preprocess process in MultiWiki dataset, I find it works when I try to convert 'tgt_str' to 'tgt' with "spm.DecodeIds(example['tgt']).split(' ')". But I fails when I try to convert 'tgt_str' to 'tgt', it's quite different.( with spm.encode(example['tgt_str']) or spm.EncodeAsIds(example['tgt_str'])). Also I fint I can't convert back by spm.DecodeIds(spm.EncodeAsIds(example['tgt_str'])).split() I wonder how did you preprocess by sentencepiece, or maybe I have to give up his sentencepiece model or try another tokenizer
Not sure if I understand your question. Here is my code snippet for the tgt_str to tgt transformation.
SOS = "<S>"
EOS = "</S>"
QOS = "<Q> "
tokenizer = load_tokenizer('spm9998_3.model')
def get_tgt(tgt_text_list: list, tokenizer):
tgt_text_list = [' '.join(word_tokenize(tgt_text)) for tgt_text in tgt_text_list]
tgt_str = SOS + " " + QOS.join(tgt_text_list) + EOS
tgt = tokenizer.encode_as_ids(tgt_str)
return tgt, tgt_str
Thanks so much for your reply. I try to encode by this:
spm = sentencepiece.SentencePieceProcessor(model_file=vocab_path)
word_padding_idx = spm.PieceToId('<PAD>')
symbols = {'BOS': spm.PieceToId('<S>'), 'EOS': spm.PieceToId('</S>'), 'PAD': word_padding_idx,
'EOT': spm.PieceToId('<T>'), 'EOP': spm.PieceToId('<P>'), 'EOQ': spm.PieceToId('<Q>')}
print(spm.EncodeAsIds(example['tgt_str']))
print(spm.decode(spm.encode(example['tgt_str']))) #I find this is quite different from tgt_str
I will try your code, by the means, what is load_tokenizer as I can't find this in sentencepiece. Is word_tokenize from nltk?
Try this.
import sentencepiece as spm
def load_tokenizer(spm_model_path):
tokenizer = spm.SentencePieceProcessor()
tokenizer.load(spm_model_path)
return tokenizer
Hi I try your code in dataset wiki in hiersumm, I meet the same problem as I meet with my own code. Can you decode back by tokenizer.decode(tgt) to tgt_str?
In wiki dataset, I can decode tgt in dataset back to tgt_str
But I can't get back by first encode tgt_str then decode.
Here is my result:
example['tgt_str']:
"`` The Essential Marcia Hines '' is a compilation album released on 30 July 2007 by Australian singer Marcia Hines . </t> <t> It was released following Hines ' induction into the ARIA Hall of Fame on 18 July 2007 . </t> <t> The album contains five top 10 singles taken from the albums , Marcia Shines , Shining , Ladies and Gentlemen and Ooh Child ."
decode from example['tgt']:
"<S> `` the essential marcia hines '' is a compilation album released on 30 july 2007 by australian singer marcia hines .<Q> it was released following hines ' induction into the aria hall of fame on 18 july 2007 .<Q> the album contains five top 10 singles taken from the albums , marcia shines , shining , ladies and gentlemen and ooh child .</S>"
first encode then decode:
"<S> `` The ⁇ ssential Marcia ⁇ ines `` is a compilation album released on 30 ⁇ uly 2007 by ⁇ ustralian singer Marcia ⁇ ines .<Q> ⁇ t was released following ⁇ ines ' induction into the ⁇ ⁇ all of ⁇ ame on 18 ⁇ uly 2007 .<Q>The album contains five top 10 singles taken from the albums , Marcia ⁇ hines , ⁇ hining , ⁇ adies and ⁇ entlemen and ⁇ oh ⁇ hild .<Q>"
Hi, I did not try to run hiersumm's wikisum dataset. In Multi-XScience project, the system decoding output looks good.
Hi, you mean you can decode back from your encoded tokens by spm9998_3.model? My main problem is first encode then decode is quite different from initial str
Yes, I can confirm that. But I didn't try to encode then decode back. I use the model output to decode to natural language. It should be the same.
I preprocess and convert multi-xscience dataset into the *.pt format required by hiersumm using spm9998_3.model. Then use the same command as hiersumm for training and decoding. The output is normal.
If you want to work on multi-xscience, I can provide you with the preprocessed *.pt file. If you want to work on hiersumm+wikisum, maybe you should ask hiersumm author about wikisum issue?
Feel free to reopen.
Hi @yaolu I'm facing a similar problem. I first used the spm9998_3.model
provided by the hierrsum author. In a lot of places the decoded output had '??'. It wasn't recognising the symbols. These are some of the symbols where the error occurred
{^, ≥ ≤ © & ≃ ⨁ θ ’ – “ ” = } — Δ ∞ → ∈ ± °}
. I passed these as user_defined_symbols
while running spm.train()
on your dataset and it decoded it properly to an extent. There are still several instances where it's not decoding properly. Could you provide me with a link of the .pt files of your dataset that you used to run hierrsum? Thanks.
Hi, as for hiersumm, how did you prepare your data, did you use his sentencepiece model, how did you encode your text, I noticed he didn't release the code for preparing data. Thanks in advance. Also for evaluation, do you think which one is better, replace cite_2 with @cite or just remove cite_2.