xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models
https://arxiv.org/abs/2201.05966
Apache License 2.0
546 stars 58 forks source link

Questions about MultiWOZ and SMD (KVRET) #2

Closed ShaneTian closed 2 years ago

ShaneTian commented 2 years ago

Thank you for your awesome work!

I have two questions about structured knowledge processing on MultiWOZ and SMD (KVRET) datasets:

  1. For MultiWOZ dataset, what is ontology_values for non-categorical slots (e.g. name, time)

https://github.com/HKUNLP/UnifiedSKG/blob/65157f72d259c88d14603dd33ce747124e286f33/seq2seq_construction/multiwoz.py#L87-L88

  1. For SMD (KVRET) dataset, the whole KB (without any explicit / hidden row selection) is fed into linearized structured knowledge, right?
Timothyxxx commented 2 years ago

Hello~ Thanks for your attention on this work!

Actually, we wrote detailed examples in F.13(MultiWoZ2.1) and F.14(SMD) section in Appendix of our paper. You can check that or check the code to see what we did in this work.

In short, for non-categorical slots in MWoZ2.1, we give it a "none" to fill the place; and for SMD dataset, we implemented two versions(original, and Mem2Seq pre-processed which is the-facto in ToD field) of it. We just linearized its corresponding KB(which can be formulated as table, triple... we choose to formulate it to table in our paper) and concat it with the dialogue history. The results for both version are shown in main table and tables in Appendix. Our consider about length is also in the Appendix.

Of course, better linearization or other improvement can be further studied in the future. Looking for your reply if you have any further questions!

ShaneTian commented 2 years ago

Thank you for your detailed reply!

ShaneTian commented 2 years ago

Sorry, there is one more question about KVRET. In UnifiedSKG data processing: kvret

kvret_glmp

Is that ↑ right?

Timothyxxx commented 2 years ago

Hi,

Thanks for your asking, that's totally correct!

Actually at first we want to use the official dataset and run experiments directly(since t5 is very robust and we think it may not relies preprocessing and we could set a new setting), but we didn't recognize the entities and underscored them in structured input(it was actually our missing, we are not sure how much it could affect the result). Moreover, that brings problems during evaluation caused by the extraction of entities(which we write some post-processing to overcome them). On the other hand, the processing by GMLP not only underscored the entities but also did other processing(like "'s" -> "_s", and some fix in the entity file, which makes it more reasonable). Therefore, we decided to run the preprocessing version as well at the relatively late stage to better support our conclusions(and as you can see, the result preprocessing version is over 70 in micro F1, which is better than the original version).

Hope this information helpful!

Thanks!

ShaneTian commented 2 years ago

Thanks for your reply!

So, in the first official version, you should have add underscores in both structured input and utterances. But in the experiments of paper, you just add underscores in utterances.

Timothyxxx commented 2 years ago

Yes, indeed. (Though according to the observation from other dataset, it won't affect a lot.)

ShaneTian commented 2 years ago

Thank you again ❤️