xlang-ai / UnifiedSKG

[EMNLP 2022] Unifying and multi-tasking structured knowledge grounding with language models
https://arxiv.org/abs/2201.05966
Apache License 2.0
550 stars 58 forks source link

[Deprecated] Separate setting: OverflowError: cannot fit 'int' into an index-sized integer #23

Closed puraminy closed 2 years ago

puraminy commented 2 years ago

in fetaqa config file if I change concatenate to separate and run prefix tuning the following error occurs

file train.py", line 185, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/trainer.py", line 1260, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/pouramini/UnifiedSKG/utils/dataset.py", line 116, in __getitem__
    max_length=self.tokenizer.model_max_length,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2406, in __call__
    **kwargs,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2476, in encode_plus
    **kwargs,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 480, in _encode_plus
    verbose=verbose,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2913, in prepare_for_model
    return_attention_mask=return_attention_mask,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2731, in pad
    return_attention_mask=return_attention_mask,
  File "/home/pouramini/anaconda3/envs/uni/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 3065, in _pad
    encoded_inputs["attention_mask"] = [1] * len(required_input) + [0] * difference
OverflowError: cannot fit 'int' into an index-sized integer
Timothyxxx commented 2 years ago

Hi,

Could you check your transformers version and datasets version and make sure they are the same with we provided?

Thanks

puraminy commented 2 years ago

I run it on conda virtual env. and yes they are the same

transformers==4.9.2

datasets==1.14.0

Timothyxxx commented 2 years ago

Ok, then it seems like a problem caused by RAM?

ChenWu98 commented 2 years ago

Hi,

The separate feature is deprecated and not used in our experiment, so we did not test it. In our latest commit e337e3b0b492772ba10f35bf64112c10d16d203c, we attempted to fix this bug. In short, the model_max_length flag of T5 is set as a very large integer, which causes RAM overflow. Could you check the latest code? Thanks!

ChenWu98 commented 2 years ago

However, you may face GPU memory overflow since the input length is effectively doubled when using separate.

puraminy commented 2 years ago

I manually changed the utils/dataset.py according to your commit. However self.tokenizer.input_max_length isn't recognized. I fixed it to 100 and now the problem was resolved.

By the way, what is the difference of separate and concatenate setting?

Timothyxxx commented 2 years ago

Hi, thanks for pointing out, we will fix that sooner. For the meanings of separate and concatenate version: separate: Get the prefix weight from query, structured knowledge separately but has been deprecated by us since it didn't improve the result in our early version. concat: concat the query, structured knowledge and context to formulate them as one sentence input and get its prefix weight. It is simple and effective thus we adopt it as defaulted.

Contact us if you have some further findings!