wetneb / openrefine-wikibase

This repository has migrated to:
https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase
Other
100 stars 24 forks source link

Long text query may incorrectly return empty results array #116

Open diegodlh opened 3 years ago

diegodlh commented 3 years ago

The reconciliation query's query field "is searched for with both search APIs provided by the Wikibase instance (the auto-complete API and the search API)".

The auto-complete API (wbsearchentities) "searches for entities using labels and aliases". Wikidata labels and aliases seem to be limited to 250 characters (I'm not sure what the limit is in other Wikibase instances). As a result, any query longer than 250 characters would return an empty results array from Wikidata API's wbsearchentities (I've just posted a task in Phabricator suggesting that it returns an error instead, as the query would be nonsense)

On the other hand, the search API (query&list=search) searches page content (including labels and aliases, I understand). This endpoint has a query-length limit of 300 characters. In this case, the endpoint does return an error (instead of an empty results array) if the limit is exceeded, but openrefine-wikibase seems to ignore this error.

As a result, reconciliation queries with a query field longer than 300 characters will always return an empty results array (as long as the query doesn't fit one of the exceptions to the reconciliation workflow).

This may make an user believe that there is no item matching their query, when in reality the query had an error.

Would it make sense to either limit the length of the query field, or handle the error returned by the search API?

Why is the wbsearchentities endpoint used if the search API searches page content including labels and aliases? Assuming there is a reason (I'm sure there is), would this reason imply that the length of the query field should be further limited to 250 characters instead, or that the wbsearchentities error response proposed in my Phabricator ticket should be handled (if ever implemented)?

Thank you!

wetneb commented 3 years ago

I have just added some explanation of why we are using two endpoints here: https://openrefine-wikibase.readthedocs.io/en/latest/architecture.html#reconciliation (Taken from http://ceur-ws.org/Vol-2773/paper-17.pdf, which I have also linked in the readme).

Given OpenRefine's current behaviour I am not sure about the benefit of returning an error rather than an empty list of results. In fact, the protocol does not really define a way to return an error for a single query in a batch (perhaps that's something worth adding?).

diegodlh commented 3 years ago

Hi, Antonin! Thank you for your reply. Sorry I couldn't check it before.

Thank you for the example you added to the documentation explaining why both endpoints are used. It is very clear! And thanks for sharing your paper too!

I understand you are not sure about the benefit of returning an error rather than an empty list of results. I'm not sure either. Let me explain the situation that gave rise to this suggestion, in case it helps clear things up.

Cita is a Wikidata addon that provides citations metadata (i.e., what sources a given source cites) support to the reference management software Zotero. It is able to get this information from Wikidata, where citing and cited items are linked via P2860 "cites work" statements.

However, to get citations, the QID of the citing item must be known. This is where the Wikidata reconciliation service enters the scene. Cita sends a reconciliation query including unique identifiers (DOI and ISBN, if available) in the properties field, and the item's title in the query field.

A user is working with old books which have long titles, such as Q106923254 with a 348-character title. Because of Wikidata's 250-character limit on labels and aliases, the item's label is a short version of the title, whereas the full title appears as a P1476 "title" statement.

A Cita user (or any Wikidata reconciliation service consumer) may be tempted to submit a query for the full title. First, the wbsearchentities endpoint would return an empty array (this is what I posted a Phabricator ticket about, because I think it should return an error instead). After your example about why both endpoints are used, I agree it may be OK that this is just ignored by the reconciliation API (after all, it's somehow similar to the "Lovelace, Ada" example).

Now, the query&list=search endpoint would return an error, because the maximum query length is 300 characters. The reconciliation API currently seems to ignore this error.

As a result, the reconciliation service would return an empty array and the user may think that an item for that book does not exist in Wikidata, and create a duplicate. Had the user queried a shorter 300-character version of the title, the wbsearchentities would still have returned an empty array, but the query&list=search endpoint would have found this string in the content of the page and returned Q106923254 (actually query&list=search doesn't seem to be searching P1476 statements, but I think that's a bug).

Sorry if this was too long and if this issue may be irrelevant to other users of the reconciliation API. Feel free to close it if you think that might be the case. It just occurred to me when I found this in Cita that it might be relevant, but again I'm not sure. To work this around in Cita, I may just refuse to reconcile items with titles longer than 250 characters and ask the user to provide an alternative short title.

Thank you for your useful project and for taking the time to read me!

wetneb commented 3 years ago

Yes you are obviously right, it's not because OpenRefine doesn't do error-handling properly that we should prevent other clients from doing so…

So I think we really need a error handling mechanism in the protocol for that. I have opened an issue for it here: https://github.com/reconciliation-api/specs/issues/69. If you have ideas about what syntax we should use for it, feel free to chime in there :)

diegodlh commented 3 years ago

Thank you for taking care of this, @wetneb! I'm already following the issue you opened :)