wot-oss / proposal

Eclipse Public License 2.0
5 stars 0 forks source link

proposal: thing-model-catalog: identification of ThingModels #10

Open alexbrdn opened 8 months ago

alexbrdn commented 8 months ago

[Description]

Identification of ThingModels

Thing Models need to be unambiguously identified for the purpose of referencing them within the context of Thing Model Catalog (TMC) (e.g. search results, fetching from TMC) or outside the TMC (e.g. TDs or other TMs in the field.

This proposal intends to describe how TMC is going to handle identification of TMs.

Note that the w3c TD standard does not provide guidance on this so far (see https://github.com/w3c/wot-thing-description/issues/1905). Hence, whatever solution we implement, it might become incompatible with the standard later.

Context

Each TM within the catalog will include manufacturer's name, manufacturer part number, author's name, and version (together - identifying fields). These data, apart from version, will have to be provided at import (and their presence will be enforced by the importing API). The version may be included in the TM or it may need to be generated automatically when importing a TM.

Requirements for identification system

Use cases to be considered

[How]

Proposed Identification

Each TM within the catalog is uniquely identified by the combination of identifying fields. The id field of a TM is composed of these fields as follows:

id: [author_name "/"] manufacturer_name "/" mpn "/" ([ optional_path_part "/" ])* version ".tm.json"

author_name, manufacturer_name, and mpn must be present in the TM at the following paths, respectively: $/schema:author/name, $/schema:manufacturer/name, $/schema:mpn. These fields are defined by https://schema.org/author, https://schema.org/manufacturer, and https://schema.org/mpn. All three fields are sanitized for use as parts of URI path by replacing all consequent whitespace and special characters not allowed in base file names with "-".

When author_name and manufacturer_name are the same, the author_name is omitted from the id.

Optional path parts may be added by the author when importing to TMC.

This id schema can be closely followed by the storage schema. It lets a contributor define her own hierarchy for TMs and any additional files that may be provided along with the TMs.

Handling of the version field

The version field of the id closely follows the format of Golang modules' pseudo-version numbers. It has the following format:

version: base_version "-" timestamp "-" content_hash

Details on implementing specific use cases

Importing a TM

Importing presents two distinct sub-cases with regard to identifying the TM after it's been imported:

  1. The TM being imported has no id

Generate an id as described above.

  1. The TM already has an id.

If the id does not conform to the TM id schema described above, move the original id to a link relation of type "original". Generate an id as per TM id schema. Do the same if the id does conform to this schema, but identifying fields in the id are not equal to those in the TM.

If the id does conform to the TM id schema, generate the new version value.

After the id of the TM being imported has been determined, compare it with existing TMs. For this purpose, versions with the same base version and the same content hash are considered equal, irrespectful of the timestamp value. The TMC CLI/API may abort importing if the TMC already contains the same TM with equivalent contents apart from version timestamp.

Identifying a TM in search results

Search results will print out the name of the remote where the result has been found and the id, including the version.

Fetching a TM

The id with version as returned by the search should be enough to fetch a TM. A TM can also be fetched by an incomplete ID, where the version field is skipped (referring to the latest version), or the version field contains only the base semantic version (refers to the latest of versions with the same base version part)

Referencing a TM

It is up to the Consumer to refer to a TM by its id as provided in the TM file (i.e. relative IRI), or by resolved absolute IRI.

Privacy Considerations

With regards to Privacy Considerations outlined in WoT TD standard (https://www.w3.org/TR/wot-thing-description11/#sec-privacy-consideration), a Consumer working in a privacy-sensitive context SHOULD NOT include the link to the TM in generated TDs. In this way, leaking of potentially private information contained in a TM's id will be avoided.

[Documentation]

<--- OPTIONAL: if you feel it is needed, provide a related documentation as described in the README.md of the repository --->

cc => @hadjian @andrisciu

egekorkan commented 8 months ago

Open_question: how can the version be generated, if it is to follow the standard's recommendation of SEMVER format? Generating random versions (e.g. uuid) prevents ordering of versions and automatically determining which version is the latest.

There is a planned work on this in the standardization. If you have any input on how this should happen, we can happily take this input. My initial idea is to use the same concepts from API versioning. E.g. removing an affordance or changing its data schema is a breaking change, adding an affordance, description, or title is a new feature, and fixing a typo in a description/title is a patch. In case this is standardized, it will be very probably part of the discovery spec, which will also contain TMs, which means that TM directories may check for version changes (it is a bit of a stretch but a realistic one). E.g. if there is a breaking change but the major version is not incremented, TM is rejected.

alexbrdn commented 8 months ago

EDIT: restructured text, outlined use cases relevant for TM identification considerations

alexbrdn commented 8 months ago

EDIT: added proposal on how to hande versions of TMs in IDs

a-hennig commented 7 months ago

I think we cannot assume, that the TM comes from the manufacturer him/herself. Thats probably why the author_name is proposed as first element in the path. To avoid accidental confusion between "authoring responsible entity" and the "person writing it", we could rename it to origin_name or similar

a-hennig commented 7 months ago

If I see an instantiated TD, I want to be able to check its authenticity ... so I dont think leaving out the reference to the TM is a good thing (and didnt get, how it helps on privacy). Leaving the source in might also be needed for copyright / author's acknowledgement (of the entity authoring it, not the person unless so chosen)

we also need a way to verify integrity, i.e. that it hasnt been manipulated.

daHaimi commented 7 months ago

I would opt for identification and URL-resolution by some standard URI like purl (e.g. https://github.com/package-url/purl-spec)

This would allow for thi URI identify the thin as TM and define where it can be found/parsed to URL or local path, e.g.:

This also requires semantic versioning and allows for addressing "standards" and "alternatives"

hadjian commented 7 months ago

I think we cannot assume, that the TM comes from the manufacturer him/herself. Thats probably why the author_name is proposed as first element in the path. To avoid accidental confusion between "authoring responsible entity" and the "person writing it", we could rename it to origin_name or similar

Yep, we noticed the confusion when talking about it. Should name it "authority" or something.

hadjian commented 7 months ago

If I see an instantiated TD, I want to be able to check its authenticity ... so I dont think leaving out the reference to the TM is a good thing (and didnt get, how it helps on privacy). Leaving the source in might also be needed for copyright / author's acknowledgement (of the entity authoring it, not the person unless so chosen)

we also need a way to verify integrity, i.e. that it hasnt been manipulated.

Agree. Also it addresses consumers in "privacy sensitive" environments. Doesn't have an impact on this proposal. @alexbrdn why did you include the paragraph?

hadjian commented 7 months ago

I think we cannot assume, that the TM comes from the manufacturer him/herself. Thats probably why the author_name is proposed as first element in the path. To avoid accidental confusion between "authoring responsible entity" and the "person writing it", we could rename it to origin_name or similar

Agree, but we are using terms from schema.org and there is no authority or origin_name there. Actually author can have a organization or person value, so per schema.org it is correct. Maybe some other terms from CreativeWork are more clear. https://schema.org/CreativeWork:

@a-hennig @alexbrdn what do you think?

hadjian commented 7 months ago

@alexbrdn if the original id is actually a templated id, like currently suggested in the standard, do we also move it to a link relation?

a-hennig commented 7 months ago

https://github.com/web-of-things-open-source/proposal/issues/10#issuecomment-1802723072 As a "dcerms guy", I would use https://www.dublincore.org/specifications/dublin-core/dcmi-terms/elements11/publisher/

a-hennig commented 7 months ago

https://github.com/web-of-things-open-source/proposal/issues/10#issuecomment-1802732757 Is this about TM instantiation? I though this is out of scope for the catalog ?

https://w3c.github.io/wot-thing-description/#example-65 suggests, that the TM leading to an TD instance should/could be referred in a link from the TD. Implies that the TM is indicated by a URL (not a pure ID).

Personally, I think this is an example of putting to many assorted things into the links section of a TD ... that might be better in dedicated tags/sections .. maybe in another charter.

In my instantiation code, I additional use/overuse the syntactic flexibility of the version tag (https://w3c.github.io/wot-thing-description/#version-serialization-json ) to also store the version of the TM there ... using the id of the TM as key, leaving it to guesswork, that that key is a TM used to instantiate and not some other version-relevant piece.

alexbrdn commented 7 months ago

If I see an instantiated TD, I want to be able to check its authenticity ... so I dont think leaving out the reference to the TM is a good thing (and didnt get, how it helps on privacy). Leaving the source in might also be needed for copyright / author's acknowledgement (of the entity authoring it, not the person unless so chosen)

we also need a way to verify integrity, i.e. that it hasnt been manipulated.

Including or leaving out the reference to the TM in TD is up to the producer of TD. I have made that remark to raise awareness that the id may include potentially privacy-relevant information, as noted in the standard.

Integrity verification is out of scope for this proposal and, in my opinion, is in no way hampered by the proposed identification scheme.

alexbrdn commented 7 months ago

Maybe some other terms from CreativeWork are more clear. https://schema.org/CreativeWork:

  • schema:publisher
  • schema:producer
  • schema:provider
  • schema:creator (same as author)

@a-hennig @alexbrdn what do you think?

I don't see much difference between producer and author for our purposes. Provider and publisher seem less fitting. I'd stick with author (or creator)

alexbrdn commented 7 months ago

@alexbrdn if the original id is actually a templated id, like currently suggested in the standard, do we also move it to a link relation?

In case it has the correct mandatory fields for the file being imported (same author, etc.), then no. Otherwise, yes.

alexbrdn commented 7 months ago

EDIT: updated proposal in light of in-person discussion on Nov 6th. the largest change is that the version now builds the file name.

N.B. that as it currently stands, in the general case, an ID cannot be unambiguously parsed without knowing whether the TM is "official", i.e. authored by manufacturer, or not. For example, "omnicorp/senseall/temperature/v1.0.0-20231115134253-abcdef012345.tm.json" may mean

author_name = "omnicorp"
manufacturer_name = "senseall"
mpn = "temperature"
optional_path_parts = ""

or it may mean

author_name = "omnicorp"
manufacturer_name = "omnicorp"
mpn = "senseall"
optional_path_parts = "temperature"

Opinions on this change are very much welcome

alexbrdn commented 7 months ago

@daHaimi https://github.com/web-of-things-open-source/proposal/issues/10#issuecomment-1794974532

purl seems to be rather specific for software packages. I do not see how it can be bent to cover all our requirements.