Timeout in large SPARQL queries

ExarcaFidalgo commented 3 years ago

Queries like country_authors are unable to be completed when given countries with a large number of subjects (Spain, Germany, US...). In order to get them working, I explored the possibility of using Named Solution Sets to reduce time of building the subsetting.

Nevertheless, when comparing times I found that although with margin for improvement, the response time for such subquery (see the following) was quite small in comparison with the whole (5s, whereas it takes at least 60s for Wikidata timeout).

WITH {  
SELECT DISTINCT ?author WHERE {    
     ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:{{ q }} .  
}} AS %authors

Therefore, the time-consuming process had to be elsewhere. Said subquery gets every person from or associated to the indicated country, and checks the existence of publications (mainly) for all of them. Large countries have a notable quantity of people (Spain has 160K+) so I suspected processing such a lot was producing the timeout.

So, if we try and set a max size for the subsetting:

WITH {  
SELECT DISTINCT ?author WHERE {    
     ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:{{ q }} .  
}
LIMIT 5000
} AS %authors

It gets us over 600 results in ~3 seconds. This way, we get a nice sample of related investigators in a short time (open to improvement with NSS).

If we not only wish to show a number of them, but to list the most relevant ones, we'd need to go through all the subsetting using a number of limited subqueries, and moving our initial position with OFFSET. Ideally, looking at the way Scholia showcases the recovered data, we'd build gradually the results as the subsequent queries provide them. The feasibility of such idea is yet unknown to me.

For now, I'll try to get working the simply limited version of such query for those relevant countries in our Scholia.

ExarcaFidalgo commented 3 years ago

Testing such LIMIT of 5000, we can get a sample of authors for almost every country.

Only exception I found -for now- is the United States. Interestingly, since its query does work in the WDQS (takes a long time, though).

Organizations doesn't seem to work in these countries as well...

A smaller LIMIT does provide results for the United States. It seems that, in order to guarantee the proper working of this query, the maximum size of the subsetting should be 2000-2500.

ExarcaFidalgo commented 3 years ago

So, in a new branch of the fork I created a second version of the sparqlToDataTable function in scholia.js: sparqlToDataTableLimits

The idea was to try and implement a sequential partial querying and test it for _countryauthors. It is a simple concept: going through the whole subsetting (in this case, all people related to said country) in small batches of 1000. When one of these queries provides a set of results, we add them to the datatable and increase the initial offset for the next query in 1000.

SELECT DISTINCT ?author WHERE {    
?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:{{ q }} .  
}  
LIMIT 1000  
OFFSET {w}

The following (positive, may I say) observations are obtained from this experiment:

Those queries that produced a timeout due to the excessive size of the subsetting (Spain, Germany, US) now provide results.
Not only do they provide results, but they appear rather quickly, given the small nature of the batches.
They are capable of returning, eventually, the totality of subjects of interest from the subsetting.

There are some problems, though.

Even though it seems to provide the total amount of possible authors, it takes up some time to build up to that moment in the largest cases. For those, it usually returns 100-500 authors every 2 seconds, for a total of 40000+. That is, although we have available some results from the very beginning and they grow quickly, there still may be some minutes of waiting if we desire to find a particular subject.
Some rows appear repeated, as they appear in different queries. Some kind of filtering would be needed.

Some examples. Spain (37K results):

Germany (in process)

ExarcaFidalgo commented 3 years ago

Alright then. I fixed the repetitions issue by registering all IDs of the extracted data and looking for repeated instances before displaying it.

In a heartbeat, I also put this new system to work with country_organizations. It adapts easily. So now, every country shows a gradual list of Authors and Organizations.

Daniel-Mietchen commented 3 years ago

This looks like good progress!

Do you have an estimate of the amount of (additional) strain we are putting on the SPARQL endpoint this way due to the potentially large number of queries generated by a single Scholia page with multiple panels? Scholia gets several 10k pageviews a day already.

I would expect efficiency gains if this approach were to be combined with things like caching of past results (e.g. the top 1000 authors from the last run), subsetting (e.g. only authors with at least X citations), dumps (to decrease the workload of the SPARQL endpoint) or perhaps ASK or COUNT queries or a less comprehensive rendering of the results table (e.g. only the top 200 instead of potentially thousands).

Keep going in this direction!

ExarcaFidalgo commented 3 years ago

I've opened a new issue to focus in the improvement of this approach: #9.

weso / weso-scholia

Timeout in large SPARQL queries #8