Query Subsetting - Githubissues

mistermboy commented 3 years ago

Another approach for improving the performance in scholia queries could be to extract a subset of entities for each query. For example, in country authors query the first subquery extracts all the authors for a given country:

 SELECT DISTINCT ?author WHERE {
    ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:{{q}}.
  }

Then, for each author, a further query is made.:

 SELECT
    ?author 
    (COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
    (SAMPLE(?organization_) AS ?organization)
    (SAMPLE(?work) AS ?example_work)
  WHERE {
    INCLUDE %authors
    ?work wdt:P50 ?author .
    OPTIONAL { ?citing_work wdt:P2860 ?work . }
    OPTIONAL {
      ?author wdt:P1416 | wdt:P108 ?organization_ .
      ?organization_ wdt:P17 wd:Q32
    }
  }
  GROUP BY ?author

Our aim is to extract a subset of these authors in order to decrease the overall query time.

mistermboy commented 3 years ago

For countries with a small number of authors such as Luxembourg(Q32) we can use the wikidata-entity-extractor tool to extract a subset of authors from the country. This tool will make use of the wikidata API to concurrently collect all the data associated with each entity resulting from the execution of the query.

mistermboy commented 3 years ago

After mounting a local Blazegraph instance with the Luxembourg authors subset we have executed the authors subquery for both Wikidata and Blazegraph endpoints.

 SELECT DISTINCT ?author WHERE {
    ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
  }

We have obtained the following results in milliseconds:

Endpoint	Country	Nº nodes	Mean	Longest Time	Shortest Time	Iteration 0	Iteration 1	Iteration 2	Iteration 3	Iteration 4	Iteration 5	Iteration 6	Iteration 7
Wikidata	Luxembourg	6004	888.5	1604	231	1310	254	1076	231	1332	1070	1604	231
Local Blazegraph	Luxembourg	6004	75.125	183	41	183	63	74	68	60	70	42	41

As you can see, there is an improvement in the performance of the query with Blazegraph's local endpoint. In this specific query, there is not a large number of nodes, so the gain is not really significant. However, with another country with a large number of authors such as Spain or Germany, the improvement could be more significant.

mistermboy commented 3 years ago

At this point, next step was to try running the full country authors query for Luxembourg with the authors subquery federated to our local Blazegraph instance from the WDQS:

WITH {
  SELECT DISTINCT ?author WHERE {
     SERVICE<http://156.35.82.22/bigdata/sparql>{
      ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
     }
  }
} AS %authors

However, that's only possible for a limited list of endpoints. So we opted to go the other way round by executing the full query under the Blazegraph endpoint and federating the other part of the query to wikidata:


SELECT
  ?number_of_citing_works
  ?author ?authorLabel
  ?organization ?organizationLabel
  ?example_work ?example_workLabel
WITH {
  SELECT DISTINCT ?author WHERE {
    ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
  }
} AS %authors
WITH {
    SELECT
      ?author 
      (COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
      (SAMPLE(?organization_) AS ?organization)
      (SAMPLE(?work) AS ?example_work)
    WHERE {
      INCLUDE %authors
     SERVICE<https://query.wikidata.org/sparql>{
      ?work wdt:P50 ?author .
      OPTIONAL { ?citing_work wdt:P2860 ?work . }
      OPTIONAL {
        ?author wdt:P1416 | wdt:P108 ?organization_ .
        ?organization_ wdt:P17 wd:q32
      }

    }
  }
    GROUP BY ?author 
} AS %results
WHERE {
  INCLUDE %results
}
ORDER BY DESC(?number_of_citing_works)

We´ve compared the performance between the execution of the original query from WDQS and this query from the local Blazegraph instance and the results (in milliseconds) are as follows:

Endpoint	Country	Nº nodes	Mean	Longest Time	Shortest Time	Iteration 0	Iteration 1	Iteration 2	Iteration 3	Iteration 4	Iteration 5	Iteration 6	Iteration 7
Wikidata	Luxembourg	440	3203	5477	241	4294	4059	5477	3076	241	3420	2611	2446
Local Blazegraph (federated)	Luxembourg	440	25623	32878	22933	32878	26616	24266	22933	24885	2454	24624	24237

As you can see, in this case, there is no improvement in performance. In fact, the results of the federated query are much worse than those of the original query.

weso / weso-scholia

Query Subsetting #10