Closed puraminy closed 2 years ago
It seems the configuration file accepts file path too:
[bert]
location = /home/pouramini/pret/t5-base
Now my problem is how can I specify the number of training and test records, I don't want to use whole training records.
Hi,
What do you mean by number of training and test records
.
Thanks
My dataset have 700,000 training records, but I want to train the model on 4000 of them. How can I do that?
Maybe I should set training steps, anyway I tried a parameter and I get the error:
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--num_training_steps', '4000']
Sorry, I still don't understand😂. Did you mean you want to control the training epochs/steps?
Yes, to control them, (to make a limit in the number of training and test records or steps).
I've already hardcode it in the dataset file. I would like to know if there is a parameter or a better solution for that.
def _generate_examples(self, filepath):
"""Yields examples."""
with open(filepath, encoding="utf-8") as f:
tsv_reader = csv.DictReader(f, delimiter="\t")
idx = 0
for example in tsv_reader:
idx += 1
if "train" in filepath and idx > 4000:
break
if ("test" in filepath or "val" in filepath) and idx > 1000:
break
If you want to control the examples of train set and dev set, what you did is right.
But if you want to train for less epoch or steps, you should change the hyper-parameter num_train_epochs
or max_steps
. For more questions, see trainer doc of huggingface in here.
Just to add to that, I wrote a tiny piece of code to train.py
to load only specified number of train and validation examples. The code is the snippet between START
and END
. The other lines are just to show where exactly the snippet is added in the train.py
file.
print('task_args.bert.location:', task_args.bert.location)
task_raw_datasets_split: datasets.DatasetDict = datasets.load_dataset(
path=task_args.dataset.loader_path,
cache_dir=task_args.dataset.data_store_path)
### >>>>>> START
# Limit dataset size if relevant args are provided
if task_args.dataset.max_train_samples:
task_raw_datasets_split['train'] = datasets.arrow_dataset.Dataset(
task_raw_datasets_split['train']._data[:task_args.dataset.max_train_samples])
if task_args.dataset.max_val_samples:
task_raw_datasets_split['validation'] = datasets.arrow_dataset.Dataset(
task_raw_datasets_split['validation']._data[:task_args.dataset.max_val_samples])
if 'test' in task_raw_datasets_split:
task_raw_datasets_split['test'] = datasets.arrow_dataset.Dataset(
task_raw_datasets_split['test']._data[:task_args.dataset.max_val_samples])
### <<<<<<< END
task_seq2seq_dataset_split: tuple = utils.tool.get_constructor(task_args.seq2seq.constructor)(task_args).\
to_seq2seq(task_raw_datasets_split, cache_root)
The above code assumes that the number of train and val examples are mentioned in configure/META_TUNING/<task>.py
under [dataset]
. If specified, it uses only the specified number for training/validation. Note that this code doesnt skip loading the entire dataset but just slices the input after loading so that only specified number is used for training/validation.
#EXAMPLE
[dataset]
loader_path = ./tasks/spider.py
data_store_path = ./data
use_cache = False
max_train_samples = 10
max_val_samples = 10
Hope you find this useful!
I have problem in downloading or caching the PLM due to my connection and blocked websites.
I want to give the folder of PLM which was already downloaded. How can I set it in ymal? Thanks