wmo-im / wis2-topic-hierarchy

WIS2 Topic Hierarchy
https://wmo-im.github.io/wis2-topic-hierarchy
Apache License 2.0
6 stars 4 forks source link

consistent hierarchy levels in centre-id? #136

Open aurisnoctis opened 6 months ago

aurisnoctis commented 6 months ago

Dear colleagues, when looking at the centre-id.csv I noticed that at first it seems to separate hierarchy levels by hyphens, e.g. for DWD:

de-dwd: <country>-<institution>

But then it doesn't seem the case such as in

de-dwd-gts-to-wis2 where the last 3 items seem to be one name, but suggest further hierarchy levels via the hyphens.

or in fr-meteo-france: after the country, only "meteo" would be the institution when machine-parsing with a hyphen as separator.

As far as I understood, the scheme urn:wmo:md:{centre_id}:{local_identifier} offers the opportunity to parse the origin of a dataset without opening it. In the examples above, hyphens as hierarchy level separators are mixed with hyphens as part of names. That will make automatic parsing of the data source ambiguous.

Best regards, Hella Riede (DWD)

tomkralidis commented 6 months ago

Multiple hyphens are allowed in centre-ids. The first hyphen delineates between the TLD and centre name. See https://wmo-im.github.io/wis2-topic-hierarchy/standard/wis2-topic-hierarchy-DRAFT.html#_centre_identification, Permission 2A for more information. Beyond this there is no hierarchy assumed or implied.

aurisnoctis commented 6 months ago

If everything after the first hyphen is interpreted as centre-name, that means that no hierarchy between the centre-ids can be derived by software without prior knowledge.

When just inspecting for example

de-dwd-gts-to-wis2 de-dwd-global-cache fr-meteo-france

a machine wouldn't know a priori that the "root center name" in one case is de-dwd (one hyphen) and in the other case fr-meteo-france (two hyphens).

When inspecting the following two centre-ids ca-eccc-msc ca-eccc-msc-global-discovery-catalogue the common "root center name" could be either ca-eccc or ca-eccc-msc.

That means it cannot be established unambiguously what the institution releasing the data actually is. If I want more data from the institution that released with centre-id ca-eccc-msc-global-discovery-catalogue, I would not know whether to look for ca-eccc or ca-eccc-msc.

Of course that might not have been the primary goal here, but for us one of the reasons to adhere to the WMO scheme with all our open data metadata (instead of plain UUIDs) was that it will be clear at which institution a dataset originates. In the above scenario, the hierarchy information of what the institution actually is, is lost, because after the country and the first hyphen, the end of the institution part and the start of a routine or other component at that institution is not defined.

tomkralidis commented 6 months ago

Description of datasets is in the remit of WCMP2 / discovery metadata. WIS2 Global Discovery Catalogue (GDC) search results have the core discovery/description constructs (identification, data policy, access links, spatiotemporal extents). WTH itself is in support of a topic structure for Pub/Sub and event driven architecture. As well, the centre-id is not responsible for articulating the dataset originator (again, in the remit of WCMP2).

aurisnoctis commented 5 months ago

@tomkralidis OK, then I misinterpreted the introduction at https://wmo-im.github.io/wis2-topic-hierarchy/standard/wis2-topic-hierarchy-DRAFT.html#_centre_identification

From

It is a single identifier comprised of a top-level domain (TLD) and centre name. It represents the data publisher, distributor or issuing centre of a given dataset, data product, data granule or other resource.

I (wrongly) deduced that it would be in fact clear what of the given possibilities is actually given, whether it is for example the "issuing centre of a given dataset" or a "data product". Now I think I understand that another resource will in fact be needed to understand what hierarchy level is actually given in the centre-id.

a top-level domain (TLD) and centre name

implied in my view that it will be clear from the centre-id what the TLD is (chars before 1st hyphen ✔️) and what the centre name is. The latter must then be everything that follows the first hyphen. So after the TLD the actual issuing centre (in the sense of "institution") cannot be deduced without the above-mentioned additional resource, because as stated in https://github.com/wmo-im/wis2-topic-hierarchy/issues/136#issuecomment-2090996059, one does not know in the ID where the institution ends and where the product or some other sub entity begins.

golfvert commented 3 months ago

We have never considered that extracting the name of the institution or the name of the service (typically dwd or gts-to-wis2) was a requirement. Is there a need for this ?

tomkralidis commented 3 months ago

One can derive this for global services by always checking the last token for an approved global service type (i.e. https://github.com/wmo-im/wcmp2-codelists/blob/main/codelists/global-service-type.csv). But that's only a partial use case.

Having said this, the centre-id lookup clearly provides attribution of the publishing centre along with the associated WCMP2 record, which is available in properties.metadata_id in a WNM payload and will become a required element at some point in the future. WCMP also defines contacts at the dataset level that define the publishing centre.

aurisnoctis commented 3 months ago

Nice! fr-meteo-france migrated to fr-meteofrance at centre-id.csv. That means parsing with the scheme <country>-<institution>-<more details such as global service type> just became more feasible.

Implicit scheme now

To understand a centre-id as a consumer of a metadata service, I can now probably without opening the data itself

  1. read the first 2 words separated by a hyphen and ending in a hyphen to understand from which country it comes and from which institution (Canada still uses a hyphen in the institution name ca-eccc-msc, but probably just ca and eccc in the parsing without msc can be used to understand what is meant).
  2. match the last 2 to 3 words separated by hyphens with global service types

Dedicated separator in the centre-id scheme?

However, I still think still more robust would be a scheme where the separator of functional units within the centre-id can't be used within proper names. That would make interpreting the different parts of the centre-id probably much easier:

(A) Hyphens - are always strictly separators in the scheme <country>-<institution>-<more details such as global service type> and can't be part of the name of an institution, e.g., de-dwd-global_cache or de-dwd-globalcache or de-dwd-gts_to_wis2 or data-metoffice-noaa_global_cache. The hyphens cut the centre-id cleanly into its logical parts instead of into a mix of logical parts and name parts.

(B) Alternative: hyphens are so ubiquitous in names that it is not easy to use it as a reserved separator so another char could be used as separator as was done in WIS before, e.g., de.dwd.global-cache or de.dwd.gts-to-wis2 or data.metoffice.noaa-global-cache. In this case the . dot cuts the centre-id into logical parts.

@golfvert Thank you for the question

We have never considered that extracting the name of the institution or the name of the service (typically dwd or gts-to-wis2) was a requirement. Is there a need for this ?

We assumed as open data team at DWD that adopting the WIS2.0 scheme for all our metadata IDs would bring the advantage of being able to identify from what country and which institution a dataset comes without opening it. Mixing the function of hyphens as separators with them being part of names as well makes the automatic/machine interpretation of a centre-id ambiguous.

It may very well be that this type of machine readability was never in the scope of the centre-id but it would be a missed opportunity in my opinion.

Best regards!

golfvert commented 3 months ago

I don't think we will go to change the hyphen to something else to make the country more visible. Sorry. The centre-id is designed to be a unique name, structured around the country, the name of the institution and some additional fields when needed. Not something that can be used in the reverse way. What is doable is to add in the centre-id.csv file additional columns with the country and the name of the institution. Getting and caching this updated file from github would give the same result. Acceptable ?

tomkralidis commented 3 months ago

We discussed dotted paths previously and decided not to use them due to edge cases where other message queuing protocols (for example, running an MQTT/AMQP bridge/facade). We also decided against _ given that centre-ids would be part of URLs/endpoints. Note that eccc-msc delineates a specific branch of our organizational structure that we (ECCC/MSC) choose to mint as a "centre" in WIS2. The second token of centre-id is the "centre" per se. We also decided that WIS2 Global Services would be suffixed with their respective function.

The centre-id.csv has the institution name, but not the country in a human readable form. Using TLDs helps in providing the country name. Having said this, the centre-id implementation does have some heuristics in order to identify global services, which would be implemented with something like:

global_services = [ 
    'global-broker',
    'global-cache',
    'global-discovery-catalogue',
    'global-monitoring'
]

centre_id = 'ca-eccc-msc-global-discovery-catalogue'

# split centre-id on the first dash
tld, centre = centre_id.split('-', 1)

# strip any global service function identification
[centre := centre.replace(f'-{gs}', '') for gs in global_services]

print(tld, centre)