yale-nlp / DocMath-Eval

Data and Code for the paper "DocMath-Eval: Evaluating Math Reasoning Capabilities of LLMs in Understanding Long and Specialized Documents"
14 stars 0 forks source link

DatasetGenerationError: cannot download the testmini data #2

Closed acDante closed 1 month ago

acDante commented 1 month ago

Hi I am trying to download your dataset from Huggingface with your example code, but got the following error when the script was downloading the complong_testmini set. Do you know how to resolve this issue?

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1869, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   [1868](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1868) try:
-> [1869](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1869)     writer.write_table(table)
   [1870](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1870) except CastError as cast_error:

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/arrow_writer.py:580, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
    [579](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/arrow_writer.py:579) pa_table = pa_table.combine_chunks()
--> [580](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/arrow_writer.py:580) pa_table = table_cast(pa_table, self._schema)
    [581](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/arrow_writer.py:581) if self.embed_local_files:

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2283, in table_cast(table, schema)
   [2282](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2282) if table.schema != schema:
-> [2283](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2283)     return cast_table_to_schema(table, schema)
   [2284](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2284) elif table.schema.metadata != schema.metadata:

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2242, in cast_table_to_schema(table, schema)
   [2237](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2237)     raise CastError(
   [2238](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2238)         f"Couldn't cast\n{_short_str(table.schema)}\nto\n{_short_str(features)}\nbecause column names don't match",
   [2239](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2239)         table_column_names=table.column_names,
   [2240](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2240)         requested_column_names=list(features),
   [2241](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2241)     )
-> [2242](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2242) arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
   [2243](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2243) return pa.Table.from_arrays(arrays, schema=schema)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2242, in <listcomp>(.0)
   [2237](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2237)     raise CastError(
   [2238](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2238)         f"Couldn't cast\n{_short_str(table.schema)}\nto\n{_short_str(features)}\nbecause column names don't match",
   [2239](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2239)         table_column_names=table.column_names,
   [2240](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2240)         requested_column_names=list(features),
   [2241](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2241)     )
-> [2242](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2242) arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
   [2243](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2243) return pa.Table.from_arrays(arrays, schema=schema)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1795, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   [1794](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1794) if isinstance(array, pa.ChunkedArray):
-> [1795](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1795)     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   [1796](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1796) else:

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1795, in <listcomp>(.0)
   [1794](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1794) if isinstance(array, pa.ChunkedArray):
-> [1795](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1795)     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   [1796](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1796) else:

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2100, in cast_array_to_feature(array, feature, allow_primitive_to_str, allow_decimal_to_str)
   [2099](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2099) elif not isinstance(feature, (Sequence, dict, list, tuple)):
-> [2100](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2100)     return array_cast(
   [2101](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2101)         array,
   [2102](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2102)         feature(),
   [2103](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2103)         allow_primitive_to_str=allow_primitive_to_str,
   [2104](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2104)         allow_decimal_to_str=allow_decimal_to_str,
   [2105](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2105)     )
   [2106](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:2106) raise TypeError(f"Couldn't cast array of type\n{_short_str(array.type)}\nto\n{_short_str(feature)}")

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1797, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   [1796](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1796) else:
-> [1797](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1797)     return func(array, *args, **kwargs)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1949, in array_cast(array, pa_type, allow_primitive_to_str, allow_decimal_to_str)
   [1948](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1948)         raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
-> [1949](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1949)     return array.cast(pa_type)
   [1950](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/table.py:1950) raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")

File ~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/array.pxi:1000, in pyarrow.lib.Array.cast()

File ~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/compute.py:405, in cast(arr, target_type, safe, options, memory_pool)
    [404](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/compute.py:404)         options = CastOptions.safe(target_type)
--> [405](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/compute.py:405) return call_function("cast", [arr], options, memory_pool)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/_compute.pyx:590, in pyarrow._compute.call_function()

File ~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()

File ~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/miniconda3/envs/test/lib/python3.10/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: Float value 0.0193237 was truncated converting to int64

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
Cell In[20], [line 5](vscode-notebook-cell:?execution_count=20&line=5)
      [1](vscode-notebook-cell:?execution_count=20&line=1) from datasets import load_dataset
      [4](vscode-notebook-cell:?execution_count=20&line=4) # data = load_dataset('openai/gsm8k', 'main')
----> [5](vscode-notebook-cell:?execution_count=20&line=5) dataset = load_dataset("yale-nlp/DocMath-Eval", "default")
      [7](vscode-notebook-cell:?execution_count=20&line=7) # print(dataset["complong-testmini"][0])

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2096, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   [2093](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2093)     return builder_instance.as_streaming_dataset(split=split)
   [2095](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2095) # Download and prepare data
-> [2096](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2096) builder_instance.download_and_prepare(
   [2097](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2097)     download_config=download_config,
   [2098](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2098)     download_mode=download_mode,
   [2099](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2099)     verification_mode=verification_mode,
   [2100](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2100)     num_proc=num_proc,
   [2101](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2101)     storage_options=storage_options,
   [2102](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2102) )
   [2104](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2104) # Build dataset for splits
   [2105](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2105) keep_in_memory = (
   [2106](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2106)     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   [2107](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/load.py:2107) )

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:924, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, dl_manager, base_path, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    [922](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:922) if num_proc is not None:
    [923](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:923)     prepare_split_kwargs["num_proc"] = num_proc
--> [924](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:924) self._download_and_prepare(
    [925](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:925)     dl_manager=dl_manager,
    [926](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:926)     verification_mode=verification_mode,
    [927](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:927)     **prepare_split_kwargs,
    [928](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:928)     **download_and_prepare_kwargs,
    [929](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:929) )
    [930](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:930) # Sync info
    [931](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:931) self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:999, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    [995](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:995) split_dict.add(split_generator.split_info)
    [997](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:997) try:
    [998](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:998)     # Prepare split will record examples associated to the split
--> [999](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:999)     self._prepare_split(split_generator, **prepare_split_kwargs)
   [1000](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1000) except OSError as e:
   [1001](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1001)     raise OSError(
   [1002](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1002)         "Cannot find data file. "
   [1003](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1003)         + (self.manual_download_instructions or "")
   [1004](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1004)         + "\nOriginal error:\n"
   [1005](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1005)         + str(e)
   [1006](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1006)     ) from None

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1740, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   [1738](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1738) job_id = 0
   [1739](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1739) with pbar:
-> [1740](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1740)     for job_id, done, content in self._prepare_split_single(
   [1741](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1741)         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   [1742](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1742)     ):
   [1743](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1743)         if done:
   [1744](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1744)             result = content

File ~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1896, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   [1894](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1894)     if isinstance(e, DatasetGenerationError):
   [1895](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1895)         raise
-> [1896](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1896)     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   [1898](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a2231302e3230342e3233362e323039222c2275736572223a227869616f74616e67227d.vscode-resource.vscode-cdn.net/home/xiaotang/Project/document-based-qa/~/miniconda3/envs/test/lib/python3.10/site-packages/datasets/builder.py:1898) yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset
yilunzhao commented 1 month ago

Hi @acDante, thanks for the information! We have fixed the error on the Hugging Face side. Please try again and see if it has been resolved.

acDante commented 1 month ago

Hi @yilunzhao, thanks very much for your prompt reply! It works now! There is a minor error in the example code (i.e. the key name should be "complong_testmini" instead of "complong-testmini") I am wondering how you processed the texts and tables in the document. Do you store each table separately as a single string in the paragraphs? (or a single string may include both table and text?)

yilunzhao commented 1 month ago

Thanks for your question! For short-context setting (i.e., SimpShort and CompShort), the paragraphs contain a single string that includes both text and tables. For long-context setting (i.e., CompShort and CompLong), we separate the text and table strings in the paragraphs.