Component values revisited

jstcki commented 4 years ago

Follow-up to #37 and #38.

It turns out that in practice using componentsValues() is ~2x slower than fetching componentValues for each dimension in parallel 😅

I'm not sure if there's a solution to this on the level of this library (optimizing the generated query), or if it depends how the triple store is set up (possible to index this query?), specify the dimension values explicitly as rdfs:range …

Currently, we're using this functionality for three things:

To correctly infer the data type for filtering (e.g. gYear), which is clearly a workaround as discussed in #32 and I hope we can eventually get rid of this
To construct a default filter set for the initial query (for this we actually just use the first dimension value).
To show all values to the user, so they can adjust filters

As said, for 1. I hope that this won't be needed at all in the future. For 3. I realize that componentsValues or fetching all values up-front is probably overkill and we can should just use componentValues when we need it (ie when the UI is shown).

👉 For 2. (and 1.) I think it would really be useful to be able to specify a limit on componentValues/componentsValues to avoid over-fetching because for these cases we really only need one value. I'm not sure about ordering/sampling because using anything other than LIMIT results in a much slower query (e.g. see http://yasgui.org/short/EyeZfAUrv)

ktk commented 4 years ago

@herrstucki I'm very interested in slow queries, what would help me is to have a list of generated SPARQL queries on our datasets that run slow so I can see the query plan & talk to the triplstore vendor for optimizing it.

jstcki commented 4 years ago

@ktk See e.g. the query I linked in my first message http://yasgui.org/short/EyeZfAUrv … when you remove the LIMIT 1 and/or do any operation like ORDER BY, use MIN/MAX/SAMPLE it becomes super slow (>1.5s)

jstcki commented 4 years ago

@ktk For reference, this is a dataset with LOTS of observations which seems to slow down the componentValues query significantly (even if there are only 3 distinct values in this case).

Example query)%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_en.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_en)%2C+%22en%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_fr.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_fr)%2C+%22fr%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_it.%0A++++FILTER(LANGMATCHES(LANG(%3Flabelit)%2C+%22it%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel.%0A++++FILTER((LANG(%3Flabel_))+%3D+%22%22%5E%5Exsd%3Astring)%0A++%7D%0A++BIND(COALESCE(%3Flabel_de%2C+%3Flabel_en%2C+%3Flabel_fr%2C+%3Flabelit%2C+%3Flabel%2C+%22%22%5E%5Exsd%3Astring)+AS+%3Flabel)%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=https%3A%2F%2Fint.lindas.admin.ch%2Fquery&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table)

jstcki commented 4 years ago

The above query is ~10x faster if the labels are removed. It's still pretty slow though (~1s).

l00mi commented 4 years ago

Currently this is solved with a DISTINCT query, which does need to query all Observations to be sure that there is not one more other kind of 'value'. So it will take longer (linearely) for bigger datasets. (It can potentially be optimised)%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_en.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_en)%2C+%22en%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_fr.%0A++++FILTER(LANGMATCHES(LANG(%3Flabel_fr)%2C+%22fr%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel_it.%0A++++FILTER(LANGMATCHES(LANG(%3Flabelit)%2C+%22it%22%5E%5Exsd%3Astring))%0A++%7D%0A++OPTIONAL+%7B%0A++++%3Fvalue+(rdfs%3Alabel%7Cskos%3AprefLabel)+%3Flabel.%0A++++FILTER((LANG(%3Flabel_))+%3D+%22%22%5E%5Exsd%3Astring)%0A++%7D%0A++BIND(COALESCE(%3Flabel_de%2C+%3Flabel_en%2C+%3Flabel_fr%2C+%3Flabelit%2C+%3Flabel%2C+%22%22%5E%5Exsd%3Astring)+AS+%3Flabel)++++%0A++%0A++%7B%0A++++SELECT+DISTINCT+%3Fvalue+%0A++++WHERE+%7B%0A++++++%3Fobservation+rdf%3Atype+qb%3AObservation%3B%0A++++++++%3Chttps%3A%2F%2Fenergy.ld.admin.ch%2Fsfoe%2Fenergyproduction%2Ftech%3E+%3Fvalue%3B%0A++++++++qb%3AdataSet+%3Chttps%3A%2F%2Fenergy.ld.admin.ch%2Fsfoe%2Fenergyproduction%2Fdataset%3E.%0A%0A++++%7D%0A++%7D%0A%7D&contentTypeConstruct=text%2Fturtle&contentTypeSelect=application%2Fsparql-results%2Bjson&endpoint=https%3A%2F%2Fint.lindas.admin.ch%2Fquery&requestMethod=POST&tabTitle=Query&headers=%7B%7D&outputFormat=table) by first getting the distinct URI's and afterwards get the labels for the URI's, as the URI's have a more performant index normally.)

But overall this is should be solved by adding the 'shape' from the new cube description, which simply defines the possible values explicitly for ordinal dimensions. At least as long as we do not have any filters set. The filters need to scan anyway, but potentially a smaller part of the dataset.

zazuko / query-rdf-data-cube

Component values revisited #47