Open jstcki opened 4 years ago
@herrstucki I'm very interested in slow queries, what would help me is to have a list of generated SPARQL queries on our datasets that run slow so I can see the query plan & talk to the triplstore vendor for optimizing it.
@ktk See e.g. the query I linked in my first message http://yasgui.org/short/EyeZfAUrv … when you remove the LIMIT 1
and/or do any operation like ORDER BY
, use MIN/MAX/SAMPLE
it becomes super slow (>1.5s)
@ktk For reference, this is a dataset with LOTS of observations which seems to slow down the componentValues
query significantly (even if there are only 3 distinct values in this case).
Example query)%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_en.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_en)%2C+%22en%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_fr.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_fr)%2C+%22fr%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_it.%0A++++FILTER(LANGMATCHES(LANG(%3Flabelit)%2C+%22it%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel.%0A++++FILTER((LANG(%3Flabel_))+%3D+%22%22%5E%5Exsd%3Astring)%0A++%7D%0A++BIND(COALESCE(%3Flabel_de%2C+%3Flabel_en%2C+%3Flabel_fr%2C+%3Flabelit%2C+%3Flabel%2C+%22%22%5E%5Exsd%3Astring)+AS+%3Flabel)%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=https%3A%2F%2Fint.lindas.admin.ch%2Fquery&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table)
The above query is ~10x faster if the labels are removed. It's still pretty slow though (~1s).
Currently this is solved with a DISTINCT query, which does need to query all Observations to be sure that there is not one more other kind of 'value'. So it will take longer (linearely) for bigger datasets. (It can potentially be optimised)%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_en.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_en)%2C+%22en%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_fr.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_fr)%2C+%22fr%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_it.%0A++++FILTER(LANGMATCHES(LANG(%3Flabelit)%2C+%22it%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel.%0A++++FILTER((LANG(%3Flabel_))+%3D+%22%22%5E%5Exsd%3Astring)%0A++%7D%0A++BIND(COALESCE(%3Flabel_de%2C+%3Flabel_en%2C+%3Flabel_fr%2C+%3Flabelit%2C+%3Flabel%2C+%22%22%5E%5Exsd%3Astring)+AS+%3Flabel)++++%0A++%0A++%7B%0A++++SELECT+DISTINCT+%3Fvalue+%0A++++WHERE+%7B%0A++++++%3Fobservation+rdf%3Atype+qb%3AObservation%3B%0A++++++++%3Chttps%3A%2F%2Fenergy.ld.admin.ch%2Fsfoe%2Fenergyproduction%2Ftech%3E+%3Fvalue%3B%0A++++++++qb%3AdataSet+%3Chttps%3A%2F%2Fenergy.ld.admin.ch%2Fsfoe%2Fenergyproduction%2Fdataset%3E.%0A%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=https%3A%2F%2Fint.lindas.admin.ch%2Fquery&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table) by first getting the distinct URI's and afterwards get the labels for the URI's, as the URI's have a more performant index normally.)
But overall this is should be solved by adding the 'shape' from the new cube description, which simply defines the possible values explicitly for ordinal dimensions. At least as long as we do not have any filters set. The filters need to scan anyway, but potentially a smaller part of the dataset.
Follow-up to #37 and #38.
It turns out that in practice using
componentsValues()
is ~2x slower than fetchingcomponentValues
for each dimension in parallel 😅I'm not sure if there's a solution to this on the level of this library (optimizing the generated query), or if it depends how the triple store is set up (possible to index this query?), specify the dimension values explicitly as rdfs:range …
Currently, we're using this functionality for three things:
As said, for 1. I hope that this won't be needed at all in the future. For 3. I realize that
componentsValues
or fetching all values up-front is probably overkill and we can should just usecomponentValues
when we need it (ie when the UI is shown).👉 For 2. (and 1.) I think it would really be useful to be able to specify a
limit
on componentValues/componentsValues to avoid over-fetching because for these cases we really only need one value. I'm not sure about ordering/sampling because using anything other thanLIMIT
results in a much slower query (e.g. see http://yasgui.org/short/EyeZfAUrv)