Closed ExarcaFidalgo closed 3 years ago
Testing such LIMIT of 5000, we can get a sample of authors for almost every country.
Only exception I found -for now- is the United States. Interestingly, since its query does work in the WDQS (takes a long time, though).
Organizations doesn't seem to work in these countries as well...
A smaller LIMIT does provide results for the United States. It seems that, in order to guarantee the proper working of this query, the maximum size of the subsetting should be 2000-2500.
So, in a new branch of the fork I created a second version of the sparqlToDataTable function in scholia.js: sparqlToDataTableLimits
The idea was to try and implement a sequential partial querying and test it for _countryauthors. It is a simple concept: going through the whole subsetting (in this case, all people related to said country) in small batches of 1000. When one of these queries provides a set of results, we add them to the datatable and increase the initial offset for the next query in 1000.
SELECT DISTINCT ?author WHERE {
?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:{{ q }} .
}
LIMIT 1000
OFFSET {w}
The following (positive, may I say) observations are obtained from this experiment:
There are some problems, though.
Some examples. Spain (37K results):
Germany (in process)
Alright then. I fixed the repetitions issue by registering all IDs of the extracted data and looking for repeated instances before displaying it.
In a heartbeat, I also put this new system to work with country_organizations. It adapts easily. So now, every country shows a gradual list of Authors and Organizations.
This looks like good progress!
Do you have an estimate of the amount of (additional) strain we are putting on the SPARQL endpoint this way due to the potentially large number of queries generated by a single Scholia page with multiple panels? Scholia gets several 10k pageviews a day already.
I would expect efficiency gains if this approach were to be combined with things like caching of past results (e.g. the top 1000 authors from the last run), subsetting (e.g. only authors with at least X citations), dumps (to decrease the workload of the SPARQL endpoint) or perhaps ASK or COUNT queries or a less comprehensive rendering of the results table (e.g. only the top 200 instead of potentially thousands).
Keep going in this direction!
I've opened a new issue to focus in the improvement of this approach: #9.
Queries like country_authors are unable to be completed when given countries with a large number of subjects (Spain, Germany, US...). In order to get them working, I explored the possibility of using Named Solution Sets to reduce time of building the subsetting.
Nevertheless, when comparing times I found that although with margin for improvement, the response time for such subquery (see the following) was quite small in comparison with the whole (5s, whereas it takes at least 60s for Wikidata timeout).
Therefore, the time-consuming process had to be elsewhere. Said subquery gets every person from or associated to the indicated country, and checks the existence of publications (mainly) for all of them. Large countries have a notable quantity of people (Spain has 160K+) so I suspected processing such a lot was producing the timeout.
So, if we try and set a max size for the subsetting:
It gets us over 600 results in ~3 seconds. This way, we get a nice sample of related investigators in a short time (open to improvement with NSS).
If we not only wish to show a number of them, but to list the most relevant ones, we'd need to go through all the subsetting using a number of limited subqueries, and moving our initial position with OFFSET. Ideally, looking at the way Scholia showcases the recovered data, we'd build gradually the results as the subsequent queries provide them. The feasibility of such idea is yet unknown to me.
For now, I'll try to get working the simply limited version of such query for those relevant countries in our Scholia.