wetneb / openrefine-wikibase

This repository has migrated to:
https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase
Other
100 stars 24 forks source link

SPARQL endpoint URL lenght limits make large reconciliation batches fail #109

Closed diegodlh closed 3 years ago

diegodlh commented 3 years ago

I'm trying to reconcile a large reconciliation batch like this:

queries = {
  "q24": {
    "query": "Some title",
    "type": "Q386724",
    "type_strict": "should",
    "properties": [
      {
        "pid": "P356",
        "v": "10.1016/J.PSE.2015.08.003"
      }
    ]
  },
  ...
}

As a result, the openrefine-wikibase backend seems to be sending a SPARQL GET request (here in the source code). However, the URL query string is too large, and the SPARQL endpoint returns a "414 Request-URI Too Large" error. As a result, the openrefine-wikibase backend, which was waiting for a JSON response from the SPARQL endpoint, fails and returns a "Attempt to decode JSON with unexpected mimetype: text/html" error.

According to the Wikidata SPARQL endpoint documentation, "POST requests can alternatively accept the query in the body of the request, instead of the URL, which allows running larger queries without hitting URL length limits." However, these are not cached. Would it be possible that the sparql_wikidata method checked the length of the query_string parameter, and decide to send a POST request instead if it is too long?

Edit: attaching example reconciliation batch that runs into this. example.txt

wetneb commented 3 years ago

Very good point! I actually think it could use POST by default, I don't see any downside with that. Would you be interested in making a PR for this?

diegodlh commented 3 years ago

Thanks, @wetneb. I've just sent a PR using POST by default. The downside of not using GET, not even for small requests, would be that, according to the documentation, POST queries are not cached (but now I see that maybe that means not cached by the client, right?).

BTW, in case we ever want to use GET for short requests, I made some tests using URL https://query.wikidata.org/sparql?format=json&query=<query> with different <query> lengths:

<query> length    response code
8143              414 (URL too long)
8142              431 (Header too long)
7443              431 (Header too long)
7442              200 (OK)
wetneb commented 3 years ago

Thank you so much for the PR!

The caching issue is a good point. We do cache some of the results in redis, but not all of them. So perhaps it is still worth deciding based on the query length?

diegodlh commented 3 years ago

Sure. I'm just not sure where to set the threshold.

The threshold to trigger a 414 error seems quite fixed: <query> length > 8142 (i.e., URL length > 8194).

But I'm not sure about the threshold to trigger a 431 error (which is the lower threshold and hence the most relevant). I'm not even sure why it's being triggered. For example, in the tests I mentioned above, which I ran in Firefox, the 431 error was triggered with a <query> length > 7442. But now I ran them again from the command line using curl -G -d "format=json" "query=<query>" https://query.wikidata.org/sparql and the error triggers with <query> length > 7662. Any ideas?

Do you think we may be safe by setting a "<query> length > 7250" threshold?

wetneb commented 3 years ago

I would intuitively keep a relatively low threshold to be on the safe side - the cost of not caching a borderline query is much lower than the cost getting an error when performing that query.

wetneb commented 3 years ago

Actually, disabling caching could be useful, because the requests we don't cache in redis explicitly are very cheap and should be as current as possible (fetching items by identifiers, for instance). So I think it should be fine to just do POST requests all the time.