webis-de / archive-query-log

๐Ÿ“œ The Archive Query Log.
https://tira.io/task/archive-query-log
MIT License
22 stars 0 forks source link

Kaggle dataset #49

Closed DiTo97 closed 1 month ago

DiTo97 commented 1 month ago

I extracted the manually curated search results as a Kaggle dataset citing @heinrichreimer as the author; is that okay?

heinrichreimer commented 1 month ago

Of course, that's fine as you are giving attribution to the GitHub repository. It would be very kind of you to also cite my co-authors: Sebastian Schmidt, Maik Frรถbe, Lukas Gienapp, Harrisen Scells, Benno Stein, Matthias Hagen, and Martin Potthast. And the paper DOI would be: https://doi.org/10.1145/3539618.3591890 Thanks for putting the data on Kaggle! (By the way, clever idea to extract the examples of the unit tests ๐Ÿ˜„ Did you know that we also have some more example data in the repo? https://github.com/webis-de/archive-query-log/tree/main/data/examples)

DiTo97 commented 1 month ago

@heinrichreimer, added the other authors!

AFAIK, I cannot set the existing DOI to the dataset (only prompts for DOI generation creating a novel one for the Kaggle dataset), so I just added that as link to the paper in the dataset description.

didn't know you had more examples in the repository! at first glance, they seem to have a different data format/schema from manual annotations; might add those to the dataset as well if standardizing the schema is straightforward

heinrichreimer commented 1 month ago

Alright, thanks for the clarification for the DOI ๐Ÿ‘ The conversion should be straightforward as our examples also use a very similar JSON format. I think it would also be cool to add the link to the Kaggle dataset to the GitHub readme!

DiTo97 commented 1 month ago

sounds great, feel free to add the link to the dataset in the README!

I will expand the dataset using the additional examples; likely keeping the manual annotations format.

FYI, the reason why I put the dataset together is because at scrapegraph AI we are developing the deep search graph, and I was looking for annotated SERPs for the evaluation of the link re-ranker node, which is arguably the most critical node.

I felt like I could share it as a dataset as I found it quite useful; happy you liked the idea!

DiTo97 commented 1 month ago

@heinrichreimer exploring the additional examples, it seems like they are not in line with the dataset spirit.

IIUC, serps.jsonl contains search queries with "url query" and "serp query" in place of "query" and "interpreted query", with lots of additional information, and a separation between "url" and "wayback url" which I have yet to understand.

if that could be fine, results.jsonl instead contains a single search result per line, with its rank, and not the full set of search results per query, with the corresponding ranks, which is critical to train a semantic model for re-ranking.

heinrichreimer commented 1 month ago

IIUC, serps.jsonl contains search queries with "url query" and "serp query" in place of "query" and "interpreted query", with lots of additional information, and a separation between "url" and "wayback url" which I have yet to understand.

Yes correct, the url is the "original" URL of the website and the wayback_raw_url, for example, is the same site as archived on the Internet Archive. So you could parse the SERP's HTML from there if you want to extract more data.

If you care only for SERPs that have results, it's relatively easy to filter:

cat serps.jsonl | grep "\"results\": \[{" > /path/to/filtered.jsonl

if that could be fine, results.jsonl instead contains a single search result per line, with its rank, and not the full set of search results per query, with the corresponding ranks, which is critical to train a semantic model for re-ranking.

Yes, the results.json is just a "flipped" version of serps.json where we have one JSON line per search result instead of one JSON line per SERP.