xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models
https://arxiv.org/abs/2201.05966
Apache License 2.0
549 stars 58 forks source link

How can I specify PLM folder #19

Closed puraminy closed 2 years ago

puraminy commented 2 years ago

I have problem in downloading or caching the PLM due to my connection and blocked websites.

I want to give the folder of PLM which was already downloaded. How can I set it in ymal? Thanks

puraminy commented 2 years ago

It seems the configuration file accepts file path too:

[bert]
location = /home/pouramini/pret/t5-base
puraminy commented 2 years ago

Now my problem is how can I specify the number of training and test records, I don't want to use whole training records.

Timothyxxx commented 2 years ago

Hi,

What do you mean by number of training and test records .

Thanks

puraminy commented 2 years ago

My dataset have 700,000 training records, but I want to train the model on 4000 of them. How can I do that?

Maybe I should set training steps, anyway I tried a parameter and I get the error:

ValueError: Some specified arguments are not used by the HfArgumentParser: ['--num_training_steps', '4000']
Timothyxxx commented 2 years ago

Sorry, I still don't understand😂. Did you mean you want to control the training epochs/steps?

puraminy commented 2 years ago

Yes, to control them, (to make a limit in the number of training and test records or steps).

I've already hardcode it in the dataset file. I would like to know if there is a parameter or a better solution for that.

    def _generate_examples(self, filepath):
        """Yields examples."""
        with open(filepath, encoding="utf-8") as f:
            tsv_reader = csv.DictReader(f, delimiter="\t")
            idx = 0
            for example in tsv_reader:
                idx += 1 
                if "train" in filepath and idx > 4000:
                    break
                if ("test" in filepath or "val" in filepath) and idx > 1000:
                    break
Timothyxxx commented 2 years ago

If you want to control the examples of train set and dev set, what you did is right. But if you want to train for less epoch or steps, you should change the hyper-parameter num_train_epochs or max_steps. For more questions, see trainer doc of huggingface in here.

base-y commented 2 years ago

Just to add to that, I wrote a tiny piece of code to train.py to load only specified number of train and validation examples. The code is the snippet between START and END. The other lines are just to show where exactly the snippet is added in the train.py file.

            print('task_args.bert.location:', task_args.bert.location)
            task_raw_datasets_split: datasets.DatasetDict = datasets.load_dataset(
                path=task_args.dataset.loader_path,
                cache_dir=task_args.dataset.data_store_path)

            ### >>>>>> START
            # Limit dataset size if relevant args are provided
            if task_args.dataset.max_train_samples:
                task_raw_datasets_split['train'] = datasets.arrow_dataset.Dataset(
                    task_raw_datasets_split['train']._data[:task_args.dataset.max_train_samples])
            if task_args.dataset.max_val_samples:
                task_raw_datasets_split['validation'] = datasets.arrow_dataset.Dataset(
                    task_raw_datasets_split['validation']._data[:task_args.dataset.max_val_samples])
                if 'test' in task_raw_datasets_split:
                    task_raw_datasets_split['test'] = datasets.arrow_dataset.Dataset(
                        task_raw_datasets_split['test']._data[:task_args.dataset.max_val_samples])
            ### <<<<<<< END

            task_seq2seq_dataset_split: tuple = utils.tool.get_constructor(task_args.seq2seq.constructor)(task_args).\
                to_seq2seq(task_raw_datasets_split, cache_root)

The above code assumes that the number of train and val examples are mentioned in configure/META_TUNING/<task>.py under [dataset]. If specified, it uses only the specified number for training/validation. Note that this code doesnt skip loading the entire dataset but just slices the input after loading so that only specified number is used for training/validation.

#EXAMPLE
[dataset]
loader_path = ./tasks/spider.py
data_store_path = ./data
use_cache = False
max_train_samples = 10
max_val_samples = 10

Hope you find this useful!