weso / weso-scholia

Collaboration between Scholia and WESO
GNU General Public License v2.0
3 stars 3 forks source link

Query Subsetting #10

Open mistermboy opened 3 years ago

mistermboy commented 3 years ago

Another approach for improving the performance in scholia queries could be to extract a subset of entities for each query. For example, in country authors query the first subquery extracts all the authors for a given country:

 SELECT DISTINCT ?author WHERE {
    ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:{{q}}.
  }

Then, for each author, a further query is made.:

 SELECT
    ?author 
    (COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
    (SAMPLE(?organization_) AS ?organization)
    (SAMPLE(?work) AS ?example_work)
  WHERE {
    INCLUDE %authors
    ?work wdt:P50 ?author .
    OPTIONAL { ?citing_work wdt:P2860 ?work . }
    OPTIONAL {
      ?author wdt:P1416 | wdt:P108 ?organization_ .
      ?organization_ wdt:P17 wd:Q32
    }
  }
  GROUP BY ?author 

Our aim is to extract a subset of these authors in order to decrease the overall query time.

mistermboy commented 3 years ago

For countries with a small number of authors such as Luxembourg(Q32) we can use the wikidata-entity-extractor tool to extract a subset of authors from the country. This tool will make use of the wikidata API to concurrently collect all the data associated with each entity resulting from the execution of the query.

mistermboy commented 3 years ago

After mounting a local Blazegraph instance with the Luxembourg authors subset we have executed the authors subquery for both Wikidata and Blazegraph endpoints.

 SELECT DISTINCT ?author WHERE {
    ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
  }

We have obtained the following results in milliseconds:

Endpoint Country Nº nodes Mean Longest Time Shortest Time Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7
Wikidata Luxembourg 6004 888.5 1604 231 1310 254 1076 231 1332 1070 1604 231
Local Blazegraph Luxembourg 6004 75.125 183 41 183 63 74 68 60 70 42 41

As you can see, there is an improvement in the performance of the query with Blazegraph's local endpoint. In this specific query, there is not a large number of nodes, so the gain is not really significant. However, with another country with a large number of authors such as Spain or Germany, the improvement could be more significant.

mistermboy commented 3 years ago

At this point, next step was to try running the full country authors query for Luxembourg with the authors subquery federated to our local Blazegraph instance from the WDQS:

WITH {
  SELECT DISTINCT ?author WHERE {
     SERVICE<http://156.35.82.22/bigdata/sparql>{
      ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
     }
  }
} AS %authors

However, that's only possible for a limited list of endpoints. So we opted to go the other way round by executing the full query under the Blazegraph endpoint and federating the other part of the query to wikidata:


SELECT
  ?number_of_citing_works
  ?author ?authorLabel
  ?organization ?organizationLabel
  ?example_work ?example_workLabel
WITH {
  SELECT DISTINCT ?author WHERE {
    ?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
  }
} AS %authors
WITH {
    SELECT
      ?author 
      (COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
      (SAMPLE(?organization_) AS ?organization)
      (SAMPLE(?work) AS ?example_work)
    WHERE {
      INCLUDE %authors
     SERVICE<https://query.wikidata.org/sparql>{
      ?work wdt:P50 ?author .
      OPTIONAL { ?citing_work wdt:P2860 ?work . }
      OPTIONAL {
        ?author wdt:P1416 | wdt:P108 ?organization_ .
        ?organization_ wdt:P17 wd:q32
      }

    }
  }
    GROUP BY ?author 
} AS %results
WHERE {
  INCLUDE %results
}
ORDER BY DESC(?number_of_citing_works) 

We´ve compared the performance between the execution of the original query from WDQS and this query from the local Blazegraph instance and the results (in milliseconds) are as follows:

Endpoint Country Nº nodes Mean Longest Time Shortest Time Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7
Wikidata Luxembourg 440 3203 5477 241 4294 4059 5477 3076 241 3420 2611 2446
Local Blazegraph (federated) Luxembourg 440 25623 32878 22933 32878 26616 24266 22933 24885 2454 24624 24237

As you can see, in this case, there is no improvement in performance. In fact, the results of the federated query are much worse than those of the original query.