Closed diegodlh closed 3 years ago
Very good point! I actually think it could use POST by default, I don't see any downside with that. Would you be interested in making a PR for this?
Thanks, @wetneb. I've just sent a PR using POST by default. The downside of not using GET, not even for small requests, would be that, according to the documentation, POST queries are not cached (but now I see that maybe that means not cached by the client, right?).
BTW, in case we ever want to use GET for short requests, I made some tests using URL https://query.wikidata.org/sparql?format=json&query=<query>
with different <query>
lengths:
<query> length response code
8143 414 (URL too long)
8142 431 (Header too long)
7443 431 (Header too long)
7442 200 (OK)
Thank you so much for the PR!
The caching issue is a good point. We do cache some of the results in redis, but not all of them. So perhaps it is still worth deciding based on the query length?
Sure. I'm just not sure where to set the threshold.
The threshold to trigger a 414 error seems quite fixed: <query>
length > 8142 (i.e., URL length > 8194).
But I'm not sure about the threshold to trigger a 431 error (which is the lower threshold and hence the most relevant). I'm not even sure why it's being triggered. For example, in the tests I mentioned above, which I ran in Firefox, the 431 error was triggered with a <query>
length > 7442. But now I ran them again from the command line using curl -G -d "format=json" "query=<query>" https://query.wikidata.org/sparql
and the error triggers with <query>
length > 7662. Any ideas?
Do you think we may be safe by setting a "<query>
length > 7250" threshold?
I would intuitively keep a relatively low threshold to be on the safe side - the cost of not caching a borderline query is much lower than the cost getting an error when performing that query.
Actually, disabling caching could be useful, because the requests we don't cache in redis explicitly are very cheap and should be as current as possible (fetching items by identifiers, for instance). So I think it should be fine to just do POST requests all the time.
I'm trying to reconcile a large reconciliation batch like this:
As a result, the openrefine-wikibase backend seems to be sending a SPARQL GET request (here in the source code). However, the URL query string is too large, and the SPARQL endpoint returns a "414 Request-URI Too Large" error. As a result, the openrefine-wikibase backend, which was waiting for a JSON response from the SPARQL endpoint, fails and returns a "Attempt to decode JSON with unexpected mimetype: text/html" error.
According to the Wikidata SPARQL endpoint documentation, "POST requests can alternatively accept the query in the body of the request, instead of the URL, which allows running larger queries without hitting URL length limits." However, these are not cached. Would it be possible that the
sparql_wikidata
method checked the length of thequery_string
parameter, and decide to send a POST request instead if it is too long?Edit: attaching example reconciliation batch that runs into this. example.txt