yuyijiong / hard_retrieval_for_llm

hard long context retrieval tasks for language models
2 stars 0 forks source link

Make data available on HF #1

Open NielsRogge opened 6 days ago

NielsRogge commented 6 days ago

Hi @yuyijiong,

Niels here from the open-source team at Hugging Face. I discovered your work through AK's daily papers: https://huggingface.co/papers/2410.04422 (feel free to claim it with your HF account). I work together with AK on improving the visibility of researchers' work on the hub.

It'd be great to make the dataset available on the 🤗 hub, we can add tags so that people find them when filtering https://huggingface.co/datasets. Pushing is as easy as:

import pandas as pd
from huggingface_hub import hf_hub_download
from datasets import Dataset

# read JSON lines
filepath ="...jsonl"
df = pd.read_json(filepath, lines=True)

# convert to HF dataset
dataset = Dataset.from_pandas(df)

# push to hub
dataset.push_to_hub("your-hf-username/your-dataset")

There's then also the dataset viewer which allows people to see the first few rows in the browser: https://huggingface.co/docs/hub/en/datasets-viewer.

This would make the dataset easier accessible, and also discoverable. We can then also link the dataset to the paper page.

Let me know if you're interested/need any help.

Kind regards,

Niels

yuyijiong commented 5 days ago

Thank you. I have uploaded it to https://huggingface.co/datasets/yuyijiong/difficult_retrieval