Closed NielsRogge closed 1 month ago
Hi Niels,
Thank you for reaching out and for the support! I'm glad to hear that you discovered my work through the Hugging Face paper page.
I really appreciate the suggestion regarding adding model cards and relevant tags for visibility. I'll work on updating the model cards and will include the link to the paper as you recommended. Adding tags like "text-generation" definitely sounds helpful for making the work easier to find.
As for the dataset, I'll look into uploading it to the Hugging Face hub instead of relying on Google Drive. The dataset viewer guide you shared will be very useful.
I'll reach out if I need any further assistance. Thanks again for the guidance!
Best regards, Jintian Zhang
Hi Niels,
I'm sorry for the delayed response; other commitments have kept me from getting back to you sooner.
Our code supports loading the training dataset directly from HuggingFace. However, due to some errors, we are currently unable to use the load_dataset
function. The details of the errors we encountered can be found here. Therefore, we are using hf_hub_download(repo_id=_hf_path['repo'], filename=_hf_path['name'], repo_type="dataset")
to load the training dataset instead.
Thank you for your understanding and for your input regarding this issue.
Best regards, Jintian Zhang
Hi,
Thanks for pushing the commit and the explanation!
The reason datasets like https://huggingface.co/datasets/zjunlp/OneGen-TrainDataset-SelfRAG can't be loaded using the load_dataset
functionality is because it seems that the data was uploaded just as raw files, rather than with the Datasets library.
One could make the files compatible with Datasets by loading it from JSON and then calling push_to_hub
, which would enable:
from datasets import load_dataset
dataset = load_dataset("OneGen-TrainDataset-SelfRAG")
Hi, we have tried the following code:
from datasets import load_dataset
dataset = load_dataset("json", data_files="./self_rag/train.jsonl")
But the error is the same:
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1989, in _prepare_split_single
writer.write_table(table)
File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 583, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 3638, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: array slice would exceed array length
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/mikedean/upload/test.py", line 5, in <module>
dataset = load_dataset("json",data_files="./self_rag/train.jsonl")
File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset
builder_instance.download_and_prepare(
File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
Ok this may be because it's JSON lines instead of JSON. One solution here could be to do the following:
import pandas as pd
from huggingface_hub import hf_hub_download
from datasets import Dataset
# read JSON lines
filepath = hf_hub_download(repo_id="zjunlp/OneGen-TrainDataset-SelfRAG", filename="train.jsonl", repo_type="dataset")
df = pd.read_json(filepath, lines=True)
# convert to HF dataset
dataset = Dataset.from_pandas(df)
# push to hub
dataset.push_to_hub("your-hf-username/selfrag")
Thank you! This is a great solution! However, I have another question: why does the file train.jsonl
in the repository zjunlp/OneGen-TrainDataset-MultiHopQA
not have any errors?
Hi @MikeDean2367,
I think they just used the web interface to upload that file. It does not seem to be compatible with the Datasets library.
I see your paper does not have any linked datasets yet, did you consider uploading it?
Hi @NielsRogge,
Thank you for your feedback! I will update the paper soon and add the link to the dataset. I appreciate your suggestion!
Hi @MikeDean2367,
Niels here from the open-source team at Hugging Face. I discovered your work through the paper page: https://huggingface.co/papers/2409.05152. I work together with AK on improving the visibility of researchers' work on the hub.
It's great to see the models available on the π€ hub, would be great to add model cards, along with tags so that people find them when filtering https://huggingface.co/models. We can add tags like "text-generation" so that people will find your work. See more here: https://huggingface.co/docs/huggingface_hub/en/guides/model-cards.
The models can be linked to the paper page by adding https://huggingface.co/papers/2409.05152 in the model card.
Uploading dataset
Would be awesome to make the training dataset available on π€ , rather than Google Drive, so that people can do:
Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser. See here for a guide: https://huggingface.co/docs/datasets/loading.
Let me know if you're interested/need any help regarding this!
Cheers,
Niels ML Engineer @ HF π€