Open aurisnoctis opened 6 months ago
Multiple hyphens are allowed in centre-ids. The first hyphen delineates between the TLD and centre name. See https://wmo-im.github.io/wis2-topic-hierarchy/standard/wis2-topic-hierarchy-DRAFT.html#_centre_identification, Permission 2A for more information. Beyond this there is no hierarchy assumed or implied.
If everything after the first hyphen is interpreted as centre-name
, that means that no hierarchy between the centre-ids can be derived by software without prior knowledge.
When just inspecting for example
de-dwd-gts-to-wis2
de-dwd-global-cache
fr-meteo-france
a machine wouldn't know a priori that the "root center name" in one case is de-dwd
(one hyphen) and in the other case fr-meteo-france
(two hyphens).
When inspecting the following two centre-ids
ca-eccc-msc
ca-eccc-msc-global-discovery-catalogue
the common "root center name" could be either ca-eccc
or ca-eccc-msc
.
That means it cannot be established unambiguously what the institution releasing the data actually is. If I want more data from the institution that released with centre-id ca-eccc-msc-global-discovery-catalogue
, I would not know whether to look for ca-eccc
or ca-eccc-msc
.
Of course that might not have been the primary goal here, but for us one of the reasons to adhere to the WMO scheme with all our open data metadata (instead of plain UUIDs) was that it will be clear at which institution a dataset originates. In the above scenario, the hierarchy information of what the institution actually is, is lost, because after the country and the first hyphen, the end of the institution part and the start of a routine or other component at that institution is not defined.
Description of datasets is in the remit of WCMP2 / discovery metadata. WIS2 Global Discovery Catalogue (GDC) search results have the core discovery/description constructs (identification, data policy, access links, spatiotemporal extents). WTH itself is in support of a topic structure for Pub/Sub and event driven architecture. As well, the centre-id is not responsible for articulating the dataset originator (again, in the remit of WCMP2).
@tomkralidis OK, then I misinterpreted the introduction at https://wmo-im.github.io/wis2-topic-hierarchy/standard/wis2-topic-hierarchy-DRAFT.html#_centre_identification
From
It is a single identifier comprised of a top-level domain (TLD) and centre name. It represents the data publisher, distributor or issuing centre of a given dataset, data product, data granule or other resource.
I (wrongly) deduced that it would be in fact clear what of the given possibilities is actually given, whether it is for example the "issuing centre of a given dataset" or a "data product". Now I think I understand that another resource will in fact be needed to understand what hierarchy level is actually given in the centre-id
.
a top-level domain (TLD) and centre name
implied in my view that it will be clear from the centre-id
what the TLD is (chars before 1st hyphen ✔️) and what the centre name
is. The latter must then be everything that follows the first hyphen. So after the TLD the actual issuing centre
(in the sense of "institution") cannot be deduced without the above-mentioned additional resource, because as stated in https://github.com/wmo-im/wis2-topic-hierarchy/issues/136#issuecomment-2090996059, one does not know in the ID where the institution ends and where the product or some other sub entity begins.
We have never considered that extracting the name of the institution or the name of the service (typically dwd or gts-to-wis2) was a requirement. Is there a need for this ?
One can derive this for global services by always checking the last token for an approved global service type (i.e. https://github.com/wmo-im/wcmp2-codelists/blob/main/codelists/global-service-type.csv). But that's only a partial use case.
Having said this, the centre-id lookup clearly provides attribution of the publishing centre along with the associated WCMP2 record, which is available in properties.metadata_id
in a WNM payload and will become a required element at some point in the future. WCMP also defines contacts at the dataset level that define the publishing centre.
Nice! fr-meteo-france
migrated to fr-meteofrance
at centre-id.csv. That means parsing with the scheme <country>-<institution>-<more details such as global service type>
just became more feasible.
To understand a centre-id
as a consumer of a metadata service, I can now probably without opening the data itself
ca-eccc-msc
, but probably just ca
and eccc
in the parsing without
msc
can be used to understand what is meant).However, I still think still more robust would be a scheme where the separator of functional units within the centre-id
can't be used within proper names. That would make interpreting the different parts of the centre-id
probably much easier:
(A) Hyphens -
are always strictly separators in the scheme <country>-<institution>-<more details such as global service type>
and can't be part of the name of an institution, e.g., de-dwd-global_cache
or de-dwd-globalcache
or de-dwd-gts_to_wis2
or data-metoffice-noaa_global_cache
. The hyphens cut the centre-id
cleanly into its logical parts instead of into a mix of logical parts and name parts.
de-dwd-global-cache
--> cut by separator -
into de
, dwd
, global
, cache
, a mix of functional units and parts of namesde-dwd-global_cache
--> cut by separator -
into de
, dwd
, global_cache
(B) Alternative: hyphens are so ubiquitous in names that it is not easy to use it as a reserved separator so another char could be used as separator as was done in WIS before, e.g., de.dwd.global-cache
or de.dwd.gts-to-wis2
or data.metoffice.noaa-global-cache
. In this case the .
dot cuts the centre-id
into logical parts.
@golfvert Thank you for the question
We have never considered that extracting the name of the institution or the name of the service (typically dwd or gts-to-wis2) was a requirement. Is there a need for this ?
We assumed as open data team at DWD that adopting the WIS2.0 scheme for all our metadata IDs would bring the advantage of being able to identify from what country and which institution a dataset comes without opening it. Mixing the function of hyphens as separators with them being part of names as well makes the automatic/machine interpretation of a centre-id
ambiguous.
It may very well be that this type of machine readability was never in the scope of the centre-id
but it would be a missed opportunity in my opinion.
Best regards!
I don't think we will go to change the hyphen to something else to make the country more visible. Sorry. The centre-id is designed to be a unique name, structured around the country, the name of the institution and some additional fields when needed. Not something that can be used in the reverse way. What is doable is to add in the centre-id.csv file additional columns with the country and the name of the institution. Getting and caching this updated file from github would give the same result. Acceptable ?
We discussed dotted paths previously and decided not to use them due to edge cases where other message queuing protocols (for example, running an MQTT/AMQP bridge/facade). We also decided against _
given that centre-ids would be part of URLs/endpoints. Note that eccc-msc
delineates a specific branch of our organizational structure that we (ECCC/MSC) choose to mint as a "centre" in WIS2. The second token of centre-id is the "centre" per se. We also decided that WIS2 Global Services would be suffixed with their respective function.
The centre-id.csv has the institution name, but not the country in a human readable form. Using TLDs helps in providing the country name. Having said this, the centre-id implementation does have some heuristics in order to identify global services, which would be implemented with something like:
global_services = [
'global-broker',
'global-cache',
'global-discovery-catalogue',
'global-monitoring'
]
centre_id = 'ca-eccc-msc-global-discovery-catalogue'
# split centre-id on the first dash
tld, centre = centre_id.split('-', 1)
# strip any global service function identification
[centre := centre.replace(f'-{gs}', '') for gs in global_services]
print(tld, centre)
Dear colleagues, when looking at the centre-id.csv I noticed that at first it seems to separate hierarchy levels by hyphens, e.g. for DWD:
de-dwd
:<country>-<institution>
But then it doesn't seem the case such as in
de-dwd-gts-to-wis2
where the last 3 items seem to be one name, but suggest further hierarchy levels via the hyphens.or in
fr-meteo-france
: after the country, only "meteo" would be the institution when machine-parsing with a hyphen as separator.As far as I understood, the scheme
urn:wmo:md:{centre_id}:{local_identifier}
offers the opportunity to parse the origin of a dataset without opening it. In the examples above, hyphens as hierarchy level separators are mixed with hyphens as part of names. That will make automatic parsing of the data source ambiguous.Best regards, Hella Riede (DWD)