for id in file_ids:
if 'WSJ/00/WSJ_0000.MRG' <= id <= 'WSJ/24/WSJ_2499.MRG':
train_file_ids.append(id)
if 'WSJ/22/WSJ_2200.MRG' <= id <= 'WSJ/22/WSJ_2299.MRG':
valid_file_ids.append(id)
if 'WSJ/23/WSJ_2300.MRG' <= id <= 'WSJ/23/WSJ_2399.MRG':
test_file_ids.append(id)
# elif 'WSJ/00/WSJ_0000.MRG' <= id <= 'WSJ/01/WSJ_0199.MRG' or 'WSJ/24/WSJ_2400.MRG' <= id <= 'WSJ/24/WSJ_2499.MRG':
# rest_file_ids.append(id)
The distance values are extracted from the already trained models on the ptb language modeling training set. There's no additional training involved when performing unsupervised parsing in our set up.
On reviewing the fellowing code, I find that the train data contain the test data. Is this coirrect?
https://github.com/yikangshen/Ordered-Neurons/blob/46d63cde024802eaf1eb7cc896431329014dd869/data_ptb.py#L25