IndexError: list index out of range | ATEPC English training on Tshirt dataset

hitz02 commented 3 years ago

Out-of-range error while training ATEPC model - english on T-shirt dataset.

... config.model = ATEPCModelList.LCFS_ATEPC config.evaluate_begin = 5 config.num_epoch = 6 config.log_step = 100 tshirt = ABSADatasetList.TShirt

aspect_extractor = Trainer(config=config, dataset=tshirt, checkpoint_save_mode=1, auto_device=True )

Traceback - >

TShirt dataset is not found locally, search at https://github.com/yangheng95/ABSADatasets Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Using bos_token, but it is not set yet. Using eos_token, but it is not set yet. 59%|█████▊ | 1098/1870 [00:10<00:07, 100.65it/s, convert examples to features]

IndexError Traceback (most recent call last)
in () 3 # from_checkpoint=checkpoint_path, 4 checkpoint_save_mode=1, ----> 5 auto_device=True 6 )

7 frames /usr/local/lib/python3.7/dist-packages/pyabsa/functional/trainer/trainer.py in init(self, config, dataset, from_checkpoint, checkpoint_save_mode, auto_device) 92 config.model_path_to_save = None 93 ---> 94 self.train() 95 96 def train(self):

/usr/local/lib/python3.7/dist-packages/pyabsa/functional/trainer/trainer.py in train(self) 103 self.config.seed = s 104 if self.checkpoint_save_mode: --> 105 model_path.append(self.train_func(self.config, self.from_checkpoint, self.logger)) 106 else: 107 # always return the last trained model if dont save trained model

/usr/local/lib/python3.7/dist-packages/pyabsa/core/atepc/training/atepc_trainer.py in train4atepc(opt, from_checkpoint_path, logger) 352 while not trainer: 353 try: --> 354 trainer = Instructor(opt, logger) 355 if from_checkpoint_path: 356 model_path = find_files(from_checkpoint_path, '.model')

/usr/local/lib/python3.7/dist-packages/pyabsa/core/atepc/training/atepc_trainer.py in init(self, opt, logger) 70 len(self.train_examples) / self.opt.batch_size / self.opt.gradient_accumulation_steps) * self.opt.num_epoch 71 train_features = convert_examples_to_features(self.train_examples, self.label_list, self.opt.max_seq_len, ---> 72 self.tokenizer, self.opt) 73 all_spc_input_ids = torch.tensor([f.input_ids_spc for f in train_features], dtype=torch.long) 74 all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)

/usr/local/lib/python3.7/dist-packages/pyabsa/core/atepc/dataset_utils/data_utils_for_training.py in convert_examples_to_features(examples, label_list, max_seq_len, tokenizer, opt) 188 text_right = '' 189 aspect = '' --> 190 prepared_inputs = prepare_input_for_atepc(opt, tokenizer, text_left, text_right, aspect) 191 lcf_cdm_vec = prepared_inputs['lcf_cdm_vec'] 192 lcf_cdw_vec = prepared_inputs['lcf_cdw_vec']

/usr/local/lib/python3.7/dist-packages/pyabsa/core/atepc/dataset_utils/atepc_utils.py in prepare_input_for_atepc(opt, tokenizer, text_left, text_right, aspect) 60 61 if 'lcfs' in opt.model_name or opt.use_syntax_based_SRD: ---> 62 syntacticaldist, = get_syntax_distance(text_raw, aspect, tokenizer, opt) 63 else: 64 syntactical_dist = None

/usr/local/lib/python3.7/dist-packages/pyabsa/core/apc/dataset_utils/apc_utils.py in get_syntax_distance(text_raw, aspect, tokenizer, opt) 240 # the following two functions are both designed to calculate syntax-based distances 241 if opt.srd_alignment: --> 242 syntactical_dist = syntax_distance_alignment(raw_tokens, dist, opt.max_seq_len, tokenizer) 243 else: 244 syntactical_dist = pad_syntax_based_srd(raw_tokens, dist, tokenizer, opt)[1]

/usr/local/lib/python3.7/dist-packages/pyabsa/core/apc/dataset_utils/apc_utils.py in syntax_distance_alignment(tokens, dist, max_seq_len, tokenizer) 38 if bert_tokens != text: 39 while text or bert_tokens: ---> 40 if text[0] == ' ' or text[0] == '\xa0': # bad case handle 41 text = text[1:] 42 dep_dist = dep_dist[1:]

IndexError: list index out of range

yangheng95 commented 3 years ago

TShirt dataset is not found locally, search at https://github.com/yangheng95/ABSADatasets Fail to remove the temp file C:\Users\chuan\AppData\Local\Temp\tmpuwy9h_4m Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Using bos_token, but it is not set yet. Using eos_token, but it is not set yet. 100%|██████████| 1870/1870 [00:02<00:00, 790.74it/s, convert examples to features] 100%|██████████| 470/470 [00:00<00:00, 772.57it/s, convert examples to features] 2021-08-17 14:08:26,890 INFO: >>> model: <class 'pyabsa.core.atepc.models.lcf_atepc.LCF_ATEPC'> --> Active 2021-08-17 14:08:26,890 INFO: >>> optimizer: adamw --> Active 2021-08-17 14:08:26,890 INFO: >>> learning_rate: 2e-05 --> Active 2021-08-17 14:08:26,890 INFO: >>> pretrained_bert: bert-base-uncased --> Active 2021-08-17 14:08:26,890 INFO: >>> use_bert_spc: False --> Active 2021-08-17 14:08:26,890 INFO: >>> max_seq_len: 80 --> Active 2021-08-17 14:08:26,890 INFO: >>> SRD: 3 --> Active 2021-08-17 14:08:26,890 INFO: >>> use_syntax_based_SRD: False --> Active 2021-08-17 14:08:26,890 INFO: >>> dropout: 0.5 --> Active 2021-08-17 14:08:26,890 INFO: >>> l2reg: 5e-05 --> Active 2021-08-17 14:08:26,891 INFO: >>> num_epoch: 10 --> Active 2021-08-17 14:08:26,894 INFO: >>> batch_size: 16 --> Active 2021-08-17 14:08:26,894 INFO: >>> seed: 1 --> Active 2021-08-17 14:08:26,894 INFO: >>> embed_dim: 768 --> Active 2021-08-17 14:08:26,894 INFO: >>> hidden_dim: 768 --> Active 2021-08-17 14:08:26,894 INFO: >>> polarities_dim: 3 --> Active 2021-08-17 14:08:26,895 INFO: >>> gradient_accumulation_steps: 1 --> Active 2021-08-17 14:08:26,895 INFO: >>> dynamic_truncate: True --> Active 2021-08-17 14:08:26,895 INFO: >>> evaluate_begin: 4 --> Active 2021-08-17 14:08:26,895 INFO: >>> lot_step: 100 --> Active 2021-08-17 14:08:26,895 INFO: >>> dataset_file: {'train': ['E:\PyABSA-Workspace\latest\PyABSA\examples\aspect_term_extraction\datasets\atepc_datasets\TShirt\Menstshirt_Train.xml.seg.atepc'], 'test': ['E:\PyABSA-Workspace\latest\PyABSA\examples\aspect_term_extraction\datasets\atepc_datasets\TShirt\Menstshirt_Test_Gold.xml.seg.atepc']} --> Active 2021-08-17 14:08:26,895 INFO: >>> device: cuda:0 --> Active 2021-08-17 14:08:26,895 INFO: >>> device_name: NVIDIA GeForce RTX 2080 --> Active 2021-08-17 14:08:26,895 INFO: >>> model_name: lcf_atepc --> Active 2021-08-17 14:08:26,895 INFO: >>> Version: 1.0.7.2 --> Active 2021-08-17 14:08:26,895 INFO: >>> dataset_path: TShirt --> Active 2021-08-17 14:08:26,895 INFO: >>> save_mode: 1 --> Active 2021-08-17 14:08:26,895 INFO: >>> model_path_to_save: E:\PyABSA-Workspace\latest\PyABSA\examples\aspect_term_extraction\checkpoints --> Active 2021-08-17 14:08:26,895 INFO: >>> num_labels: 6 --> Active 2021-08-17 14:08:26,895 INFO: >>> lcf: cdw --> Default 2021-08-17 14:08:26,895 INFO: >>> window: lr --> Default 2021-08-17 14:08:26,895 INFO: >>> initializer: xavieruniform --> Default 2021-08-17 14:08:26,895 INFO: >>> log_step: 50 --> Default 2021-08-17 14:08:26,895 INFO: >>> srd_alignment: True --> Default 2021-08-17 14:08:26,966 INFO: Running training for Aspect Term Extraction 2021-08-17 14:08:26,967 INFO: Num examples = 1870 2021-08-17 14:08:26,967 INFO: Batch size = 16 2021-08-17 14:08:26,967 INFO: Num steps = 1160 100%|██████████| 117/117 [01:17<00:00, 1.51it/s, Epoch:0 | No evaluation until epoch:4] 100%|██████████| 117/117 [01:17<00:00, 1.52it/s, Epoch:1 | No evaluation until epoch:4] 100%|██████████| 117/117 [01:16<00:00, 1.52it/s, Epoch:2 | No evaluation until epoch:4] 100%|██████████| 117/117 [01:14<00:00, 1.57it/s, Epoch:3 | No evaluation until epoch:4] 100%|██████████| 117/117 [01:57<00:00, 1.00s/it, Epoch:4 | loss_apc:0.5021 | loss_ate:0.1962 | APC_ACC: 89.15(max:91.06) | APC_F1: 66.08(max:70.68) | ATE_F1: 78.01(max:78.75)] 100%|██████████| 117/117 [02:05<00:00, 1.07s/it, Epoch:5 | loss_apc:0.0205 | loss_ate:0.0754 | APC_ACC: 91.49(max:91.91) | APC_F1: 69.82(max:72.05) | ATE_F1: 77.89(max:79.66)]

It works for me, maybe it cause by different encoding in your environment, you can try:

PYTHONIOENCODING=UTF8 python train_atepc.py

hitz02 commented 3 years ago

I am using Google Colab and I checked the default encoding using the below command and its UTF-8

import sys

sys.getdefaultencoding()

Can you check what might be another possible reason?

yangheng95 commented 3 years ago

Did you reclone (update) the datasets?

hitz02 commented 3 years ago

Sorry, can you explain what do you mean by reclone datasets?

I pip installed the updated version of pyabsa and my understanding is the atepc trainer automatically downloads the dataset before the training starts.

Let me know if there is any other approach that I missed out on.

yangheng95 commented 3 years ago

To reclone the datasets, you can delete the downloaded datasets, and the code should download the datasets again

yangheng95 commented 3 years ago

I updated the Tshirt atepc dataset few days ago, but the downloaded dataset can't be redownloaded automaticly.

https://github.com/yangheng95/ABSADatasets/tree/master/datasets/atepc_datasets/TShirt

hitz02 commented 3 years ago

Same error with the updated dataset as well. Can you check using google colab? I am running ATEPC English training using a checkpoint example.

yangheng95 commented 3 years ago

Same error with the updated dataset as well. Can you check using google colab? I am running ATEPC English training using a checkpoint example.

I am sorry for that, I will try on colab, however does this error occur while using every datasets?

hitz02 commented 3 years ago

Checked with Twitter dataset. Same index error.

Traceback -

Twitter dataset is not found locally, search at https://github.com/yangheng95/ABSADatasets Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Using bos_token, but it is not set yet. Using eos_token, but it is not set yet. 4%|▍ | 269/6247 [00:03<01:17, 77.15it/s, convert examples to features]

IndexError Traceback (most recent call last)
in () 3 from_checkpoint=checkpoint_path, 4 checkpoint_save_mode=1, ----> 5 auto_device=True 6 )