Explicitly vs. automatically selected dimensions

jstcki commented 4 years ago

Hi!

I just noticed a discrepancy between how explicitly and automatically selected dimensions are handled, and another aspect which makes the automatic selects less-than-useful.

Labels are only returned for explicitly selected dimensions (oddly, except for years which have an empty label).
Automatic selects can not really be used for anything since they use keys that are derived from the translated dimension label. These slugified keys can not be re-associated with a dimension (which is necessary to get the dimension's label etc.). So eventually, we end up having to manually select all dimensions anyway.

Point 2 could actually neatly be solved by not generating keys from the label but by using the dimension IRI. If behavior in point 1 would be consistent (i.e. labels present for auto-selects), this would actually remove the need to explicitly select dimensions at all.

For example:

// Instead of this
[{ forestZone: {...}, canton: {...}}, ...]
// something like this could be returned
[{ "http://environment.ld.admin.ch/foen/px/0703010000_102/dimension/1": {...}, "http://environment.ld.admin.ch/foen/px/0703010000_102/dimension/2": {...}}, ...]

If IRIs are used as keys, the argument to .select() could be simply an array of components or just their IRIs instead of having to specify binding names myself (which is also dangerous since these are not slugified!).

vhf commented 4 years ago

Hey, thanks!

Missing labels for automatically selected dimensions: will fix!
To me your suggestion makes sense. It will make a few things uglier, for instance:
- .groupBy("raum") -> .groupBy("https://ld.stadt-zuerich.ch/statistics/property/RAUM")
- .filter(({ someDate }) => someDate.not.equals("2019-08-29T07:27:56.241Z")); not possible anymore (no big deal though)

I'll try something and we'll then discuss the details in a PR.

jstcki commented 4 years ago

Note that querying for labels on all dimensions makes everything much slower, so I wonder if there would be a better way to do this. E.g. by only querying for labels in cube.dimensions() and then stitching them together with a label-less result from cube.query(). Haven't tried though.

vhf commented 4 years ago

Note that querying for labels on all dimensions makes everything much slower

Could you please tell us more about this? Would running datacube.components() to fetch all labels be too costly?

jstcki commented 4 years ago

I meant that currently, cube.select(allDimensions).query() is much slower than cube.select([]).query() because selecting dimensions queries for all dimension value labels on each observation.

This is probably related to #47 … adding labels to the query unfortunately makes it much slower.

BTW, we're currently also always setting all potential languages on the entrypoint, e.g. ["de", "fr", "it", "en", ""], because some datasets can be only available in one of these and it's not clear what the fallback should be. Does adding more languages make the query slower? This could probably be optimized if the datasets declared available languages correctly.

vhf commented 4 years ago

Yeah adding labels definitely makes things slower, and yes adding more languages makes it even slower.

I think not fetching labels for automatically selected dimensions and using dimensions IRIs as keys would solve most of the issue. Users could fetch dimensions and their labels independently and possibly cache them.

This could probably be optimized if the datasets declared available languages correctly.

@ktk what do you think about this, is it possible to declare the languages somewhere?

zazuko / query-rdf-data-cube

Explicitly vs. automatically selected dimensions #51