yafuly / SyntacticGen

14 stars 1 forks source link

Dataset #1

Closed 1251480932 closed 6 months ago

1251480932 commented 7 months ago

Dear GitHub author, Would it be possible for you to make your dataset publicly available? Also, could you provide some guidance on how to create a dataset if I wish to process my own data?

yafuly commented 7 months ago

Hi,

The original data for the paper was removed due to a system transfer.

For guidance on dataset creation, you can consider the following steps:

  1. Parse the target-side texts using an off-the-shelf parser, e.g., Berkeley Parser.
  2. Traverse the constituency parsing tree in a layer-wise order to obtain the constituents and their corresponding infilling texts at each tree level.
  3. Construct the syntax contexts and the infilling texts accordingly given the results of step 2 (See paper for details).

We provide a sample dataset in toy_data. Specifically, the "training" directory contains the source texts, target texts and parsing results (step 1), whereas the "training_triplets" directory contains the final sequence-to-sequence syntax-aware data for training (step 3).