webis-de / msmarco-llm-distillation

10 stars 0 forks source link

ir_datasets integration #1

Open heinrichreimer opened 1 month ago

heinrichreimer commented 1 month ago

Could be handy to have this dataset in ir_datasets.

fschlatt commented 1 month ago

Very much so! It would make the most sense as a sub-type of scored docs!? I will put it on my to-do list, but I will not get to it within the next two weeks. Feel free to integrate this yourself :)

mam10eks commented 1 month ago

Good question on how to integrate this, but I would also be a big fan of this!

mam10eks commented 1 month ago

I could make a first proposal, as I need to process it for some other project anyway, so I could do some first "hacking" and then we can improve upon this :)

fschlatt commented 1 month ago

Awesome! IMO the most fitting way would be to add it as a scored_docs and set the score as the negative rank: https://github.com/allenai/ir_datasets/blob/930a4e076f21b623d1de713ec434686b2c2c292d/ir_datasets/formats/base.py#L27

mam10eks commented 1 month ago

I added a first version: https://github.com/webis-de/msmarco-llm-distillation/blob/main/data/ir_datasets_scored_docs.py

heinrichreimer commented 1 month ago

Very nice!