zjunlp / OneGen

[EMNLP 2024 Findings] OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs.
MIT License
138 stars 15 forks source link

Improve HF integration #4

Closed NielsRogge closed 1 month ago

NielsRogge commented 2 months ago

Hi @MikeDean2367,

Niels here from the open-source team at Hugging Face. I discovered your work through the paper page: https://huggingface.co/papers/2409.05152. I work together with AK on improving the visibility of researchers' work on the hub.

It's great to see the models available on the πŸ€— hub, would be great to add model cards, along with tags so that people find them when filtering https://huggingface.co/models. We can add tags like "text-generation" so that people will find your work. See more here: https://huggingface.co/docs/huggingface_hub/en/guides/model-cards.

The models can be linked to the paper page by adding https://huggingface.co/papers/2409.05152 in the model card.

Uploading dataset

Would be awesome to make the training dataset available on πŸ€— , rather than Google Drive, so that people can do:

from datasets import load_dataset

dataset = load_dataset("your-hf-org/your-dataset")

Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser. See here for a guide: https://huggingface.co/docs/datasets/loading.

Let me know if you're interested/need any help regarding this!

Cheers,

Niels ML Engineer @ HF πŸ€—

MikeDean2367 commented 1 month ago

Hi Niels,

Thank you for reaching out and for the support! I'm glad to hear that you discovered my work through the Hugging Face paper page.

I really appreciate the suggestion regarding adding model cards and relevant tags for visibility. I'll work on updating the model cards and will include the link to the paper as you recommended. Adding tags like "text-generation" definitely sounds helpful for making the work easier to find.

As for the dataset, I'll look into uploading it to the Hugging Face hub instead of relying on Google Drive. The dataset viewer guide you shared will be very useful.

I'll reach out if I need any further assistance. Thanks again for the guidance!

Best regards, Jintian Zhang

MikeDean2367 commented 1 month ago

Hi Niels,

I'm sorry for the delayed response; other commitments have kept me from getting back to you sooner.

Our code supports loading the training dataset directly from HuggingFace. However, due to some errors, we are currently unable to use the load_dataset function. The details of the errors we encountered can be found here. Therefore, we are using hf_hub_download(repo_id=_hf_path['repo'], filename=_hf_path['name'], repo_type="dataset") to load the training dataset instead.

Thank you for your understanding and for your input regarding this issue.

Best regards, Jintian Zhang

NielsRogge commented 1 month ago

Hi,

Thanks for pushing the commit and the explanation!

The reason datasets like https://huggingface.co/datasets/zjunlp/OneGen-TrainDataset-SelfRAG can't be loaded using the load_dataset functionality is because it seems that the data was uploaded just as raw files, rather than with the Datasets library.

One could make the files compatible with Datasets by loading it from JSON and then calling push_to_hub, which would enable:

from datasets import load_dataset

dataset = load_dataset("OneGen-TrainDataset-SelfRAG")
MikeDean2367 commented 1 month ago

Hi, we have tried the following code:

from datasets import load_dataset
dataset = load_dataset("json", data_files="./self_rag/train.jsonl")

But the error is the same:

Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1989, in _prepare_split_single
    writer.write_table(table)
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 583, in write_table
    pa_table = pa_table.combine_chunks()
  File "pyarrow/table.pxi", line 3638, in pyarrow.lib.Table.combine_chunks
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: array slice would exceed array length

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mikedean/upload/test.py", line 5, in <module>
    dataset = load_dataset("json",data_files="./self_rag/train.jsonl")
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/disk/disk_20T/zjt/anaconda3/lib/python3.10/site-packages/datasets/builder.py", line 2016, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
NielsRogge commented 1 month ago

Ok this may be because it's JSON lines instead of JSON. One solution here could be to do the following:

import pandas as pd
from huggingface_hub import hf_hub_download
from datasets import Dataset

# read JSON lines
filepath = hf_hub_download(repo_id="zjunlp/OneGen-TrainDataset-SelfRAG", filename="train.jsonl", repo_type="dataset")
df = pd.read_json(filepath, lines=True)

# convert to HF dataset
dataset = Dataset.from_pandas(df)

# push to hub
dataset.push_to_hub("your-hf-username/selfrag")
MikeDean2367 commented 1 month ago

Thank you! This is a great solution! However, I have another question: why does the file train.jsonl in the repository zjunlp/OneGen-TrainDataset-MultiHopQA not have any errors?

NielsRogge commented 1 month ago

Hi @MikeDean2367,

I think they just used the web interface to upload that file. It does not seem to be compatible with the Datasets library.

I see your paper does not have any linked datasets yet, did you consider uploading it?

MikeDean2367 commented 1 month ago

Hi @NielsRogge,

Thank you for your feedback! I will update the paper soon and add the link to the dataset. I appreciate your suggestion!