Closed quqxui closed 7 months ago
When the program is stuck, I don't see any utilization of GPUs and output files.
Hi, thank you for your attention.
Based on the output information above, it seems like the program is getting stuck during load_dataset
.
https://github.com/yyDing1/GNER/blob/c2db8b54656811c432858bd3e75a0f0a711a6161/src/run.py#L208-L214
Please check if the settings in your dataset_config
(configs/dataset_configs/task_adaptation_configs
in the above case) match the dataset in the folder data
. Additionally, it's best to ensure that your datasets
version is greater than 2.17.0.
Hi, I changed the version of datasets to 2.17.0 and 2.18.0, and python to 3.7, 3.8 and 3.9. But it still didn't work.
I'm also confused about the arg data_config_dir
in load_dataset
. I don't see any utilization of it in repo and codes https://github.com/huggingface/datasets. I am not sure if the code is correct.
These are the versions of my env, do you have any ideas? Thanks!
torch==1.13.1
transformers==4.38.2
datasets==2.18.0
deepspeed==0.14.0
multiprocess==0.70.16
accelerate==0.28.0
protobuf==5.26.0
This is the warning when processing load_dataset
:
/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py:926: FutureWarning: The repository for gner_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at src/gner_dataset.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
I kill the process using Ctrl+C and copy paste the KeyboardInterrupt traceback:
^C[2024-03-19 06:21:06,529] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351221
Traceback (most recent call last):
Traceback (most recent call last):
File "src/run.py", line 490, in <module>
Traceback (most recent call last):
File "src/run.py", line 490, in <module>
File "src/run.py", line 490, in <module>
Traceback (most recent call last):
File "src/run.py", line 490, in <module>
main()
File "src/run.py", line 213, in main
main()
File "src/run.py", line 213, in main
main()
File "src/run.py", line 213, in main
main()
File "src/run.py", line 213, in main
raw_datasets = load_dataset(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
raw_datasets = load_dataset(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
raw_datasets = load_dataset(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
raw_datasets = load_dataset(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
builder_instance = load_dataset_builder(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
builder_instance = load_dataset_builder(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
builder_instance = load_dataset_builder(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
builder_instance = load_dataset_builder(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1772, in dataset_module_factory
dataset_module = dataset_module_factory(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1772, in dataset_module_factory
dataset_module = dataset_module_factory(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1772, in dataset_module_factory
return LocalDatasetModuleFactoryWithScript(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 960, in get_module
return LocalDatasetModuleFactoryWithScript(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 960, in get_module
return LocalDatasetModuleFactoryWithScript(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 960, in get_module
_create_importable_file(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 466, in _create_importable_file
_create_importable_file(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 466, in _create_importable_file
_create_importable_file(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 466, in _create_importable_file
importable_local_file = _copy_script_and_other_resources_in_importable_dir(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 388, in _copy_script_and_other_resources_in_importable_dir
importable_local_file = _copy_script_and_other_resources_in_importable_dir(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 388, in _copy_script_and_other_resources_in_importable_dir
importable_local_file = _copy_script_and_other_resources_in_importable_dir(
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 388, in _copy_script_and_other_resources_in_importable_dir
with FileLock(lock_path):
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 297, in __enter__
with FileLock(lock_path):
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 297, in __enter__
with FileLock(lock_path):
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 297, in __enter__
self.acquire()
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 267, in acquire
self.acquire()
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 267, in acquire
self.acquire()
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 267, in acquire
time.sleep(poll_interval)
KeyboardInterrupt
time.sleep(poll_interval)
KeyboardInterrupt
time.sleep(poll_interval)
KeyboardInterrupt
[2024-03-19 06:21:06,600] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351222
[2024-03-19 06:21:06,645] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351223
[2024-03-19 06:21:06,689] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351224
[2024-03-19 06:21:06,733] [INFO] [launch.py:325:sigkill_handler] Main process received SIGINT, exiting
Traceback (most recent call last):
File "/data2/derongxu/anaconda3/envs/py38/bin/deepspeed", line 6, in <module>
main()
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 584, in main
result.wait()
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/subprocess.py", line 1083, in wait
return self._wait(timeout=timeout)
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/subprocess.py", line 1822, in _wait
(pid, sts) = self._try_wait(0)
File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/subprocess.py", line 1780, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
I am still trying to replicate the error.
Based on my experience, the following attempts may help:
load_dataset
function, as we are using customized datasets. The specific code can be found in src/gner_dataset.py./data/zzk/anaconda3/envs/NER/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for gner_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at /data/zzk/IE/GNER/src/gner_dataset.py
You can avoid this message in future by passing the argument trust_remote_code=True
.
Passing trust_remote_code=True
will be mandatory to load this dataset from the next major release of datasets
.
warnings.warn(
Using custom data configuration default-e60fd0a2766fdda3
03/19/2024 16:02:19 - INFO - datasets.builder - Using custom data configuration default-e60fd0a2766fdda3
Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
03/19/2024 16:02:19 - INFO - datasets.info - Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
I also encounter a hang-up in the load_dataset function, but after waiting for a while(about five to ten minutes), it continues to load and run. I believe this is due to network issues.
the information of running as follows:
/data/zzk/anaconda3/envs/NER/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for gner_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at /data/zzk/IE/GNER/src/gner_dataset.py
You can avoid this message in future by passing the argument trust_remote_code=True
.
Passing trust_remote_code=True
will be mandatory to load this dataset from the next major release of datasets
.
warnings.warn(
Using custom data configuration default-e60fd0a2766fdda3
03/19/2024 16:02:19 - INFO - datasets.builder - Using custom data configuration default-e60fd0a2766fdda3
Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
03/19/2024 16:02:19 - INFO - datasets.info - Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
Generating validation split: 1400 examples [00:00, 4708.74 examples/s]
Generating test split: 6470 examples [00:00, 7400.58 examples/s]
Found cached dataset gner_dataset (/data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89)
03/19/2024 16:10:42 - INFO - datasets.builder - Found cached dataset gner_dataset (/data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89)
Loading Dataset info from /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
03/19/2024 16:10:42 - INFO - datasets.info - Loading Dataset info from /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
03/19/2024 16:10:42 - INFO - datasets.arrow_dataset - Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
03/19/2024 16:10:42 - INFO - datasets.arrow_dataset - Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89
[INFO|configuration_utils.py:727] 2024-03-19 16:10:42,960 >> loading configuration file output/flan-t5-base-task-adaptation/config.json
[INFO|configuration_utils.py:792] 2024-03-19 16:10:42,963 >> Model config T5Config {
"_name_or_path": "output/flan-t5-base-task-adaptation",
"architectures": [
"T5ForConditionalGeneration"
],
"classifier_dropout": 0.0,
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dense_act_fn": "gelu_new",
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"is_gated_act": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"n_positions": 512,
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"tie_word_embeddings": false,
"torch_dtype": "float32",
"transformers_version": "4.37.1",
"use_cache": true,
"vocab_size": 32128
}
[INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file spiece.model [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file tokenizer_config.json [INFO|modeling_utils.py:3475] 2024-03-19 16:10:43,009 >> loading weights file output/flan-t5-base-task-adaptation/model.safetensors [INFO|configuration_utils.py:826] 2024-03-19 16:10:43,017 >> Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0 }
[INFO|modeling_utils.py:4352] 2024-03-19 16:10:43,580 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.
[INFO|modeling_utils.py:4360] 2024-03-19 16:10:43,580 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at output/flan-t5-base-task-adaptation. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:779] 2024-03-19 16:10:43,583 >> loading configuration file output/flan-t5-base-task-adaptation/generation_config.json [INFO|configuration_utils.py:826] 2024-03-19 16:10:43,583 >> Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0 }
Running tokenizer on prediction dataset: 0%| | 0/6470 [00:00<?, ? examples/s]Caching processed dataset at /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89/cache-ae0964abc6dfacfd.arrow 03/19/2024 16:10:43 - INFO - datasets.arrow_dataset - Caching processed dataset at /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89/cache-ae0964abc6dfacfd.arrow Running tokenizer on prediction dataset: 100%|█████████████████████████████████████████████████████████████████████████| 6470/6470 [00:02<00:00, 2553.21 examples/s] Running tokenizer on prediction dataset: 10%|███████▎ | 643/6470 [00:00<00:01, 3222.53 examples/s][INFO|trainer.py:571] 2024-03-19 16:10:47,087 >> Using auto half precision backend 03/19/2024 16:10:47 - INFO - main - Predict [INFO|gner_trainer.py:60] 2024-03-19 16:10:47,092 >> Running Prediction [INFO|gner_trainer.py:62] 2024-03-19 16:10:47,093 >> Num examples = 6470 [INFO|gner_trainer.py:65] 2024-03-19 16:10:47,093 >> Batch size = 4 Running tokenizer on prediction dataset: 100%|█████████████████████████████████████████████████████████████████████████| 6470/6470 [00:02<00:00, 2460.44 examples/s] 10%|████████████▏ | 78/809 [01:14<12:23, 1.02s/it]
Due to the low speed and instability of the customized dataset, I have made the use of the customized dataset an optional choice in commit https://github.com/yyDing1/GNER/commit/aed2fdf9298aae23b6185f769282d58fb090f93a. You simply need to add --no_load_gner_customized_datasets
in the script, and you will need to specify a separate JSON file as test_json_dir
as an alternative.
In this file, you should format your input into the form of instructions (originally done in the customized dataset). An example of the processed file have been added, i.e. data/zero-shot-test.jsonl
.
We also provide a corresponding script scripts/eval_t5_task_adaptation_json_input.sh
, and you can run this #script.
Looking forward to your feedback. @quqxui @XQZZK
Thanks very much for your efforts to help me.
I found that the program cannot run on the A6000 machine (CUDA: 12.2), eval_t5_task_adaptation_json_input.sh
also don't work. But it can run on the V100 machine (CUDA 11.7), even they have the same conda env.
However, V100 don't support bf16 and tf32, thus the performance decrease by 0.5 .
I'm not sure if any feature of A6000 is causing this problem. If you have any ideas, please let me know.
Thanks again for your help.
Finally!! I figure out the problem: File locking in datasets
doesn't work on file-systems NFS, as described in https://github.com/huggingface/datasets/issues/6744. The file-systems type in my A6000 server is NFS.
A solution is that adding the following codes in run.py
, as described in https://github.com/huggingface/datasets/issues/6395 :
import filelock
filelock.FileLock = filelock.SoftFileLock
import datasets
Note the this solution only support this loading method:
load_dataset("json", data_files=data_args.test_json_dir, split="train")
So it works when using scripts/eval_t5_task_adaptation_json_input.sh
. But it does not work with the loading method when using scripts/eval_t5_task_adaptation.sh
.
Thanks for your help 👍 👍 👍
Hi,
I get a problem when testing t5-base model. After runing
bash scripts/eval_t5_task_adaptation.sh
, the program is stuck here in the output below.Do you have any idea about this sitation? Thanks Very much!