Program is stuck in load_dataset when testing

quqxui commented 7 months ago

Hi,

I get a problem when testing t5-base model. After runing bash scripts/eval_t5_task_adaptation.sh, the program is stuck here in the output below.

Do you have any idea about this sitation? Thanks Very much!


$ bash scripts/eval_t5_task_adaptation.sh
++ shuf -i25000-30000 -n1
+ port=27486
+ BEAM_SIZE=1
+ MODEL_NAME_OR_PATH=./../../LLM_checkpoint/GNER-T5-base
+ DATA_DIR=data
+ DATA_CONFIG_DIR=configs/dataset_configs/task_adaptation_configs
+ INSTRUCTION_FILE=configs/instruction_configs/instruction.json
+ OUTPUT_DIR=output/flan-t5-base-task-adaptation-beam1
+ RUN_NAME=flan-t5-base-experiment
+ deepspeed --include=localhost:6,7 --master_port 27486 src/run.py --bf16 True --tf32 True --generation_num_beams 1 --do_predict --predict_with_generate --model_name_or_path ./../../LLM_checkpoint/GNER-T5-base --data_dir data --preprocessing_num_workers 4 --data_config_dir configs/dataset_configs/task_adaptation_configs --instruction_file configs/instruction_configs/instruction.json --output_dir output/flan-t5-base-task-adaptation-beam1 --per_device_eval_batch_size 4 --run_name flan-t5-base-experiment --max_source_length 640 --max_target_length 640 --generation_max_length 640 --overwrite_output_dir --overwrite_cache --seed 1234
[2024-03-18 10:27:51,415] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-18 10:28:06,363] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-03-18 10:28:06,409] [INFO] [runner.py:568:main] cmd = /data2/derongxu/anaconda3/envs/py/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbNiwgN119 --master_addr=127.0.0.1 --master_port=27486 --enable_each_rank_log=None src/run.py --bf16 True --tf32 True --generation_num_beams 1 --do_predict --predict_with_generate --model_name_or_path ./../../LLM_checkpoint/GNER-T5-base --data_dir data --preprocessing_num_workers 4 --data_config_dir configs/dataset_configs/task_adaptation_configs --instruction_file configs/instruction_configs/instruction.json --output_dir output/flan-t5-base-task-adaptation-beam1 --per_device_eval_batch_size 4 --run_name flan-t5-base-experiment --max_source_length 640 --max_target_length 640 --generation_max_length 640 --overwrite_output_dir --overwrite_cache --seed 1234
[2024-03-18 10:28:10,530] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-18 10:28:16,303] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [6, 7]}
[2024-03-18 10:28:16,303] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-03-18 10:28:16,303] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-03-18 10:28:16,303] [INFO] [launch.py:163:main] dist_world_size=2
[2024-03-18 10:28:16,303] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=6,7
[2024-03-18 10:28:16,318] [INFO] [launch.py:253:main] process 417252 spawned with command: ['/data2/derongxu/anaconda3/envs/py/bin/python', '-u', 'src/run.py', '--local_rank=0', '--bf16', 'True', '--tf32', 'True', '--generation_num_beams', '1', '--do_predict', '--predict_with_generate', '--model_name_or_path', './../../LLM_checkpoint/GNER-T5-base', '--data_dir', 'data', '--preprocessing_num_workers', '4', '--data_config_dir', 'configs/dataset_configs/task_adaptation_configs', '--instruction_file', 'configs/instruction_configs/instruction.json', '--output_dir', 'output/flan-t5-base-task-adaptation-beam1', '--per_device_eval_batch_size', '4', '--run_name', 'flan-t5-base-experiment', '--max_source_length', '640', '--max_target_length', '640', '--generation_max_length', '640', '--overwrite_output_dir', '--overwrite_cache', '--seed', '1234']
[2024-03-18 10:28:16,327] [INFO] [launch.py:253:main] process 417253 spawned with command: ['/data2/derongxu/anaconda3/envs/py/bin/python', '-u', 'src/run.py', '--local_rank=1', '--bf16', 'True', '--tf32', 'True', '--generation_num_beams', '1', '--do_predict', '--predict_with_generate', '--model_name_or_path', './../../LLM_checkpoint/GNER-T5-base', '--data_dir', 'data', '--preprocessing_num_workers', '4', '--data_config_dir', 'configs/dataset_configs/task_adaptation_configs', '--instruction_file', 'configs/instruction_configs/instruction.json', '--output_dir', 'output/flan-t5-base-task-adaptation-beam1', '--per_device_eval_batch_size', '4', '--run_name', 'flan-t5-base-experiment', '--max_source_length', '640', '--max_target_length', '640', '--generation_max_length', '640', '--overwrite_output_dir', '--overwrite_cache', '--seed', '1234']
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
03/18/2024 10:28:35 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
03/18/2024 10:28:35 - INFO - __main__ - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=True,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=True,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=640,
generation_num_beams=1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=output/flan-t5-base-task-adaptation-beam1/runs/Mar18_10-28-35_sota,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=output/flan-t5-base-task-adaptation-beam1,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=8,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=flan-t5-base-experiment,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=1234,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=True,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
03/18/2024 10:28:35 - WARNING - __main__ - Process rank: 1, device: cuda:1, n_gpu: 1distributed training: True, 16-bits training: True

quqxui commented 7 months ago

When the program is stuck, I don't see any utilization of GPUs and output files.

yyDing1 commented 7 months ago

Hi, thank you for your attention. Based on the output information above, it seems like the program is getting stuck during load_dataset. https://github.com/yyDing1/GNER/blob/c2db8b54656811c432858bd3e75a0f0a711a6161/src/run.py#L208-L214 Please check if the settings in your dataset_config (configs/dataset_configs/task_adaptation_configs in the above case) match the dataset in the folder data. Additionally, it's best to ensure that your datasets version is greater than 2.17.0.

quqxui commented 7 months ago

Hi, I changed the version of datasets to 2.17.0 and 2.18.0, and python to 3.7, 3.8 and 3.9. But it still didn't work.

I'm also confused about the arg data_config_dir in load_dataset. I don't see any utilization of it in repo and codes https://github.com/huggingface/datasets. I am not sure if the code is correct.

These are the versions of my env, do you have any ideas? Thanks!

torch==1.13.1
transformers==4.38.2
datasets==2.18.0
deepspeed==0.14.0
multiprocess==0.70.16
accelerate==0.28.0
protobuf==5.26.0

quqxui commented 7 months ago

This is the warning when processing load_dataset:

/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py:926: FutureWarning: The repository for gner_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at src/gner_dataset.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(

I kill the process using Ctrl+C and copy paste the KeyboardInterrupt traceback:

^C[2024-03-19 06:21:06,529] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351221
Traceback (most recent call last):
Traceback (most recent call last):
  File "src/run.py", line 490, in <module>
Traceback (most recent call last):
  File "src/run.py", line 490, in <module>
  File "src/run.py", line 490, in <module>
Traceback (most recent call last):
  File "src/run.py", line 490, in <module>
    main()
  File "src/run.py", line 213, in main
    main()
  File "src/run.py", line 213, in main
    main()
  File "src/run.py", line 213, in main
    main()
  File "src/run.py", line 213, in main
    raw_datasets = load_dataset(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
    raw_datasets = load_dataset(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
    raw_datasets = load_dataset(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
    raw_datasets = load_dataset(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2556, in load_dataset
    builder_instance = load_dataset_builder(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    builder_instance = load_dataset_builder(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    builder_instance = load_dataset_builder(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    builder_instance = load_dataset_builder(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 2228, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1772, in dataset_module_factory
    dataset_module = dataset_module_factory(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1772, in dataset_module_factory
    dataset_module = dataset_module_factory(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 1772, in dataset_module_factory
    return LocalDatasetModuleFactoryWithScript(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 960, in get_module
    return LocalDatasetModuleFactoryWithScript(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 960, in get_module
    return LocalDatasetModuleFactoryWithScript(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 960, in get_module
    _create_importable_file(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 466, in _create_importable_file
    _create_importable_file(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 466, in _create_importable_file
    _create_importable_file(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 466, in _create_importable_file
    importable_local_file = _copy_script_and_other_resources_in_importable_dir(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 388, in _copy_script_and_other_resources_in_importable_dir
    importable_local_file = _copy_script_and_other_resources_in_importable_dir(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 388, in _copy_script_and_other_resources_in_importable_dir
    importable_local_file = _copy_script_and_other_resources_in_importable_dir(
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/datasets/load.py", line 388, in _copy_script_and_other_resources_in_importable_dir
    with FileLock(lock_path):
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 297, in __enter__
    with FileLock(lock_path):
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 297, in __enter__
    with FileLock(lock_path):
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 297, in __enter__
    self.acquire()
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 267, in acquire
    self.acquire()
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 267, in acquire
    self.acquire()
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/filelock/_api.py", line 267, in acquire
    time.sleep(poll_interval)
KeyboardInterrupt
    time.sleep(poll_interval)
KeyboardInterrupt
    time.sleep(poll_interval)
KeyboardInterrupt
[2024-03-19 06:21:06,600] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351222
[2024-03-19 06:21:06,645] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351223
[2024-03-19 06:21:06,689] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2351224
[2024-03-19 06:21:06,733] [INFO] [launch.py:325:sigkill_handler] Main process received SIGINT, exiting
Traceback (most recent call last):
  File "/data2/derongxu/anaconda3/envs/py38/bin/deepspeed", line 6, in <module>
    main()
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 584, in main
    result.wait()
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/subprocess.py", line 1083, in wait
    return self._wait(timeout=timeout)
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/subprocess.py", line 1822, in _wait
    (pid, sts) = self._try_wait(0)
  File "/data2/derongxu/anaconda3/envs/py38/lib/python3.8/subprocess.py", line 1780, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

yyDing1 commented 7 months ago

I am still trying to replicate the error.

Based on my experience, the following attempts may help:

Add the parameter "trust_remote_code=True" in the load_dataset function, as we are using customized datasets. The specific code can be found in src/gner_dataset.py.
Remove the lock file and run it again.
torch >= 2.0 is required. I encountered some other errors when trying to run with torch=1.13, and I will add this requirement to the README.md later.

XQZZK commented 7 months ago

/data/zzk/anaconda3/envs/NER/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for gner_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at /data/zzk/IE/GNER/src/gner_dataset.py You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets. warnings.warn( Using custom data configuration default-e60fd0a2766fdda3 03/19/2024 16:02:19 - INFO - datasets.builder - Using custom data configuration default-e60fd0a2766fdda3 Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 03/19/2024 16:02:19 - INFO - datasets.info - Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89

I also encounter a hang-up in the load_dataset function, but after waiting for a while(about five to ten minutes), it continues to load and run. I believe this is due to network issues.

the information of running as follows:

/data/zzk/anaconda3/envs/NER/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for gner_dataset contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at /data/zzk/IE/GNER/src/gner_dataset.py You can avoid this message in future by passing the argument trust_remote_code=True. Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets. warnings.warn( Using custom data configuration default-e60fd0a2766fdda3 03/19/2024 16:02:19 - INFO - datasets.builder - Using custom data configuration default-e60fd0a2766fdda3 Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 03/19/2024 16:02:19 - INFO - datasets.info - Loading Dataset Infos from /data/zzk/.cache/huggingface/modules/datasets_modules/datasets/gner_dataset/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 Generating validation split: 1400 examples [00:00, 4708.74 examples/s] Generating test split: 6470 examples [00:00, 7400.58 examples/s] Found cached dataset gner_dataset (/data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89) 03/19/2024 16:10:42 - INFO - datasets.builder - Found cached dataset gner_dataset (/data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89) Loading Dataset info from /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 03/19/2024 16:10:42 - INFO - datasets.info - Loading Dataset info from /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 03/19/2024 16:10:42 - INFO - datasets.arrow_dataset - Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 03/19/2024 16:10:42 - INFO - datasets.arrow_dataset - Listing files in /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89 [INFO|configuration_utils.py:727] 2024-03-19 16:10:42,960 >> loading configuration file output/flan-t5-base-task-adaptation/config.json [INFO|configuration_utils.py:792] 2024-03-19 16:10:42,963 >> Model config T5Config { "_name_or_path": "output/flan-t5-base-task-adaptation", "architectures": [ "T5ForConditionalGeneration" ], "classifier_dropout": 0.0, "d_ff": 2048, "d_kv": 64, "d_model": 768, "decoder_start_token_id": 0, "dense_act_fn": "gelu_new", "dropout_rate": 0.1, "eos_token_id": 1, "feed_forward_proj": "gated-gelu", "initializer_factor": 1.0, "is_encoder_decoder": true, "is_gated_act": true, "layer_norm_epsilon": 1e-06, "model_type": "t5", "n_positions": 512, "num_decoder_layers": 12, "num_heads": 12, "num_layers": 12, "output_past": true, "pad_token_id": 0, "relative_attention_max_distance": 128, "relative_attention_num_buckets": 32, "task_specific_params": { "summarization": { "early_stopping": true, "length_penalty": 2.0, "max_length": 200, "min_length": 30, "no_repeat_ngram_size": 3, "num_beams": 4, "prefix": "summarize: " }, "translation_en_to_de": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to German: " }, "translation_en_to_fr": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to French: " }, "translation_en_to_ro": { "early_stopping": true, "max_length": 300, "num_beams": 4, "prefix": "translate English to Romanian: " } }, "tie_word_embeddings": false, "torch_dtype": "float32", "transformers_version": "4.37.1", "use_cache": true, "vocab_size": 32128 }

[INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file spiece.model [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2025] 2024-03-19 16:10:42,964 >> loading file tokenizer_config.json [INFO|modeling_utils.py:3475] 2024-03-19 16:10:43,009 >> loading weights file output/flan-t5-base-task-adaptation/model.safetensors [INFO|configuration_utils.py:826] 2024-03-19 16:10:43,017 >> Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0 }

[INFO|modeling_utils.py:4352] 2024-03-19 16:10:43,580 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:4360] 2024-03-19 16:10:43,580 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at output/flan-t5-base-task-adaptation. If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:779] 2024-03-19 16:10:43,583 >> loading configuration file output/flan-t5-base-task-adaptation/generation_config.json [INFO|configuration_utils.py:826] 2024-03-19 16:10:43,583 >> Generate config GenerationConfig { "decoder_start_token_id": 0, "eos_token_id": 1, "pad_token_id": 0 }

Running tokenizer on prediction dataset: 0%| | 0/6470 [00:00<?, ? examples/s]Caching processed dataset at /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89/cache-ae0964abc6dfacfd.arrow 03/19/2024 16:10:43 - INFO - datasets.arrow_dataset - Caching processed dataset at /data/zzk/.cache/huggingface/datasets/gner_dataset/default-e60fd0a2766fdda3/0.0.0/0b99e6e3a59ef6dcb9af89de1cd95359e538dad5813fd3c6b8ef999024427d89/cache-ae0964abc6dfacfd.arrow Running tokenizer on prediction dataset: 100%|█████████████████████████████████████████████████████████████████████████| 6470/6470 [00:02<00:00, 2553.21 examples/s] Running tokenizer on prediction dataset: 10%|███████▎ | 643/6470 [00:00<00:01, 3222.53 examples/s][INFO|trainer.py:571] 2024-03-19 16:10:47,087 >> Using auto half precision backend 03/19/2024 16:10:47 - INFO - main - Predict [INFO|gner_trainer.py:60] 2024-03-19 16:10:47,092 >> Running Prediction [INFO|gner_trainer.py:62] 2024-03-19 16:10:47,093 >> Num examples = 6470 [INFO|gner_trainer.py:65] 2024-03-19 16:10:47,093 >> Batch size = 4 Running tokenizer on prediction dataset: 100%|█████████████████████████████████████████████████████████████████████████| 6470/6470 [00:02<00:00, 2460.44 examples/s] 10%|████████████▏ | 78/809 [01:14<12:23, 1.02s/it]

yyDing1 commented 7 months ago

Due to the low speed and instability of the customized dataset, I have made the use of the customized dataset an optional choice in commit https://github.com/yyDing1/GNER/commit/aed2fdf9298aae23b6185f769282d58fb090f93a. You simply need to add --no_load_gner_customized_datasets in the script, and you will need to specify a separate JSON file as test_json_dir as an alternative.

https://github.com/yyDing1/GNER/blob/22ca4ad47ac97de76bdc30f140e078916960a12a/scripts/eval_t5_task_adaptation_json_input.sh#L22-L23

In this file, you should format your input into the form of instructions (originally done in the customized dataset). An example of the processed file have been added, i.e. data/zero-shot-test.jsonl.

We also provide a corresponding script scripts/eval_t5_task_adaptation_json_input.sh, and you can run this #script.

Looking forward to your feedback. @quqxui @XQZZK

quqxui commented 7 months ago

Thanks very much for your efforts to help me. I found that the program cannot run on the A6000 machine (CUDA: 12.2), eval_t5_task_adaptation_json_input.sh also don't work. But it can run on the V100 machine (CUDA 11.7), even they have the same conda env. However, V100 don't support bf16 and tf32, thus the performance decrease by 0.5 .

I'm not sure if any feature of A6000 is causing this problem. If you have any ideas, please let me know.

Thanks again for your help.

quqxui commented 7 months ago

Finally!! I figure out the problem: File locking in datasets doesn't work on file-systems NFS, as described in https://github.com/huggingface/datasets/issues/6744. The file-systems type in my A6000 server is NFS.

A solution is that adding the following codes in run.py, as described in https://github.com/huggingface/datasets/issues/6395 :

import filelock
filelock.FileLock = filelock.SoftFileLock
import datasets

Note the this solution only support this loading method:

load_dataset("json", data_files=data_args.test_json_dir, split="train")

So it works when using scripts/eval_t5_task_adaptation_json_input.sh. But it does not work with the loading method when using scripts/eval_t5_task_adaptation.sh.

Thanks for your help 👍 👍 👍

yyDing1 / GNER

Program is stuck in load_dataset when testing #1