weso / weso-scholia

Collaboration between Scholia and WESO
GNU General Public License v2.0
3 stars 3 forks source link

Using a Cache: Intermediate Query Results vs Full Query Results #19

Open mistermboy opened 3 years ago

mistermboy commented 3 years ago

Another approach for removing time outs and reduce the query times could be to save the query results in a cache. At this point there are two possibilities:

For example, for the country authors query we could just save the first query results, where we get all the people from a country, and then perform the rest of the query, or we could save the result of the whole query where we already get all the authors from a country.

For the first option, we would need to figure out a way of injecting the cached intermediate results into the query as a Solution Set in orther to execute the resulting query later . Maybe this could be useful -> https://github-wiki-see.page/m/blazegraph/database/wiki/SPARQL_Update. However, intermediate query results are not the most workload-intensive comparing to the results of the resulting query,so maybe this would not greatly improve performance.

For the other option, we would save just the final results of the query. This may look to have to save a lot of information but actually is less than for the intermediate results. For example, for the country authors query with Luxembourg as a country we get 6026 humans for the intermediate query vs 452 humans for the full query One of the issues we may front of with this approach of saving the final results is in relation to the map views and similar. If we take this way then we are not performing queries to wikidata, so we can not exepect the results as a map or a graph. However, there is a way of drawing all these views from the query results using the wikidata query service dist (more docs abouts the result views)

ExarcaFidalgo commented 3 years ago

Created a NodeJS script which would take advantage of the partial querying to go through the configured queries and parameters and save them in MongoDB as a JSON. For each query a collection is created in the database; for each parameter, a document with the obtained data.

A trial took place with COUNTRY_AUTHORS passing as parameter all members of the European Union, using the unordered subsetting version (therefore, with unconsistencies). Such process took 5553 seconds, approximately an hour and a half to complete.

The related documents in the MongoDB database occupy 81.04 MB. Take into account that the actual size would be somewhat larger, since we are losing about 2-5% of the nodes per query.

image