yaserkl / RLSeq2Seq

Deep Reinforcement Learning For Sequence to Sequence Models
https://arxiv.org/abs/1805.09461
MIT License
767 stars 160 forks source link

Data pre-processing questions #13

Closed twoflypig closed 5 years ago

twoflypig commented 6 years ago

Hi, thanks for your awesome work! I met two problems in data pre-processing stage.

  1. It seems that the code missing a import statement in src/helper/cnn_dm_downloader.py. In line https://github.com/yaserkl/RLSeq2Seq/blob/0095a768b4c2ab65babf87806e7c372d22cde3f0/src/helper/cnn_dm_downloader.py#L42, you use Article class. However, it should be imported from newspaper module.

  2. Error in input data? https://github.com/yaserkl/RLSeq2Seq/blob/0095a768b4c2ab65babf87806e7c372d22cde3f0/src/helper/cnn_dm_downloader.py#L83 , in this line you want the input ended by htmls. However, in your src/helper/README.rst, section Download Raw Data , you said

you need to do is to download the "*.story" files for each dataset

After the data download from link, I find it ended with .story, which will not be processed by the code.

Looking forward to your replay. Thank you !

yaserkl commented 5 years ago

Thanks for mentioning the missing import. Yes, you need the newspaper library for that class. Also, you can download the original html news articles from the original CNN/DM dataset and this code will take care of those files, too. However, if you only have .story files, simply set the mode option to anything but "article" and it will process everything from .story files.