proposal: thing-model-catalog: create a catalog of Thing Models

hadjian commented 1 year ago

The Thing Model Catalog

The goal of this proposal is to motivate the development of a Thing Model Catalog. The catalog shall enable the collection of Thing Models for fixed function devices, provided and curated by the community. This will enable users of the catalog to share the low-value efforts in interfacing with devices that lack self-description in a standardized way.

Consumption

We target the use case of searching for Thing Models for devices to be onboarded to arbitrary WoT consumers. We identified three simple workflows:

Manual Browsing: A system integrator searches the catalog manually and finds a matching Thing Models. The Thing Model is downloaded manually, Thing Descriptions are generated from the TM and supplied to a WoT consumer. The WoT consumer is now able to establish communication to the devices.
WoT Consumer Integration: A WoT consumer that offers integration with the catalog interface offers browsing, searching and filtering with a custom user interface. Via this user interface, the user selects a Thing Model from the query results, specifies replacement parameters to convert the Thing Model to Thing Descriptions and onboards devices to the WoT consumer.
Bundling with a Consumer: A connectivity solution provides the Thing Model Catalog as an integrated part of its offering, i.e. will publish a curated copy of the catalog with the solution.
Programmatic Consumption: A connectivity solution offers autodiscovery of devices and searches the catalog for matching Thing Models to be loaded for instantiation.

Contribution

To enable the contribution of Thing Models, the catalog needs to define a contribution interface. This interface needs to enable the following traits:

Provenance: A catalog shall make the ownership and authority of contributions apparent. A catalog shall provide hosting infrastructure, but the hoster shall not be responsible for all contributions. These need to be owned and managed by contributors.
Quality Gate: The interface shall act as a quality gate to enforce a minimum standard for contributed Thing Models, as there can be different levels of quality. The catalog shall communicate this quality level to consumers.
Metadata: The catalog shall support other data in addition to the Thing Models, such as author information, rating, handbook, product image etc.
Context Extensions: The catalog shall allow custom extensions to Thing Models, which are not checked for validity in a first release.
Fast Feedback Loop: A contributor shall be able to execute the same tests locally, which are required for a contribution to pass to be accepted by a remote catalog.

How

This section lists requirements that need to be addressed in an implementation without proposing a technical solution.

Storage Schema

The storage schema, e.g. directory structure and file name convention in case of a file system needs to allow for human browsability as well as programmatic processing.

The storage schema might allow for a contributor designed hierarchical layout for his/her contributions.

The storage schema must allow grouping of related files, e.g. handbook, image, metadata file.

Identification

A user must be able to uniquely identify a Thing Model. As the Thing Description standard does not enforce a specific ID schema, the catalog must generate unique IDs for Thing Models.

Quality Stages

We currently see the following quality stages:	Stage Name	Description
Syntactic Validation	TM validates against official JSON Schema
Semantic Validation	JSON-LD shape is validated
Runtime Validation	TM was used to interact with a device and device responses validate against data schema in TM
Human Validation	Values returned from device correspond to expected values, e.g. strings are decoded correctly
Connectivity Solution Compatibility	List of runtimes, with which the TM was tested
Rating	Means of expressing perceived quality by users

Allow for Duplicates

A catalog must accept multiple Thing Models for the same device, as multiple users might author them independently. A consumer can decide which one to trust and can get a hint by the quality stages passed by the different Thing Models.

Access Restrictions

Contributions shall be owned by one user, who will have the exclusive right to change a Thing Model. The catalog shall enable proposed changes by other users to be accepted by the owning user.

Offline / Online Usage

It must be possible to use a snapshot of an online catalog in an offline scenario

Cross Catalog Search

It must be possible to search across multiple catalogs to enable the concurrent use of a private and a public catalogs.

Needs Discussion

Support Thing Model Compilation for composed or inheriting Thing Models
Instance creation from Thing Models to Thing Descriptions

cc => @hadjian @andrisciu

ondrejtomcik commented 1 year ago

Hi @hadjian. Few proposals on the content:

A system integrator searches the catalog manually and finds a matching TM. The TM is downloaded manually, TDs are generated from the TM and used to load the parameters into the connectivity solution and the communication to the device can be established.

There's potential for misunderstanding here. I suggest rephrasing it to: A system integrator accesses the WoT Consumer's user interface and searches for the appropriate TM using various filters, such as manufacturer name, product name, or other relevant identifiers. Upon finding a match - WoT Consumer getting the link to matching TDs from the catalogue, the WoT Consumer fetches the TM. After that the WoT Consumer prompts the user for the necessary information to convert the TM to the TD.

A connectivity solution provides the Thing Model Catalog as an integrated part of its offering, i.e. will publish a curated copy of the catalog with the solution.

The WoT Consumer's can integrate the snapshot of the catalogue for offline search in case the solution is disconnected from the internet.

"A connectivity solution"

I would propose to use the WoT terminology here - WoT Consumer.

It must be possible to search across multiple repositories to enable the concurrent use of a private and a public repository.

Repository is a new term in this document. I think you wanted to use catalogue.

hadjian commented 1 year ago

Thx @ondrejtomcik for your input.

A system integrator searches the catalog manually and finds a matching TM. The TM is downloaded manually, TDs are generated from the TM and used to load the parameters into the connectivity solution and the communication to the device can be established.

There's potential for misunderstanding here. I suggest rephrasing it to: A system integrator accesses the WoT Consumer's user interface and searches for the appropriate TM using various filters, such as manufacturer name, product name, or other relevant identifiers. Upon finding a match - WoT Consumer getting the link to matching TDs from the catalogue, the WoT Consumer fetches the TM. After that the WoT Consumer prompts the user for the necessary information to convert the TM to the TD.

You are right that it doesn't seem to be clear, as I was thinking of a different scenario. And the one you are describing is actually missing. Scenario 1. is supposed to describe a scenario where there is no integration between the catalog and a WoT consumer, so a user needs to be able to browse the catalog manually, download a file, replace the parameters and load it into the WoT consumer by any means the WoT consumer offers to load TDs.

I will add the integrated scenario and try to rephrase the first scenario.

A connectivity solution provides the Thing Model Catalog as an integrated part of its offering, i.e. will publish a curated copy of the catalog with the solution.

The WoT Consumer's can integrate the snapshot of the catalogue for offline search in case the solution is disconnected from the internet.

I don't quite understand the difference to what I wrote. I meant that the WoT consumer will integrate a snapshot of the catalog when it is released to be used in an offline scenario. So the dev team of the WoT consumer download the snapshot, curate it and bundle it with their consumer. The consumer can then use the catalog without internet access.

"A connectivity solution"

I would propose to use the WoT terminology here - WoT Consumer.

Agree.

It must be possible to search across multiple repositories to enable the concurrent use of a private and a public repository.

Repository is a new term in this document. I think you wanted to use catalogue.

Agree.

hadjian commented 1 year ago

Added labels for the scenarios to be more clear
Added scenario WoT Consumer Integration
Removed abbreviations
Replaced connectivity solution with WoT consumer

hadjian commented 1 year ago

Added a quality stage for Semantic Validation

@mkovatsc You are deep into that topic. Our plan is to allow context extensions and custom terms, but will not validate in a first release. Do you see any obstacle to integrate semantic validation in a future release?

hadjian commented 1 year ago

Added titles to the requirements in Contribution section.

hadjian commented 1 year ago

Notes for proposal review:

Roles: Device Manufacturers, Connectivity Solutions Providers, Integrators, Domain Experts
Mandatory metadata for browsing and searching (author, manufacturer, mpn)
Directory structure
ID (close to dtmi)
TOC for simple catalog (use forges or simple HTTP server)
Namespaces for quick contributions
Official manufacturer namespaces

a-hennig commented 1 year ago

need to be more precise on "semantic" quality stage.

semantics is transported via @type, which (even if it originates from json-ld) is defined via the wot jschema. i.e. the semantic stage should be split into

carries semantic information = filled @type, with namespace definition in context if needed
carries semantic information in a industry standard ontology = i.e. "is helpful", i.e. filled @type, refering to well-known ontologies like BRICK, eclass etc, not "only" product specific/local namespace
json-ld -loadable, i.e. can be loaded by a json-ld processor .. namespaces do not need to be resolveable
json-ld -fully resolveable, i.e. certain json-ld / rdf operations can be performed, because all namespaces can be resolved and contain valid data

the first two points (semantics) are independent on the last two (json-ld) .. although it is hard to imagine, that a "json-ld loadable" does not use "@type".

additionally: make sure to avoid any implied staging, e.g. "runtime requires semantic"

a-hennig commented 1 year ago

"mandatory metadata" .. I doubt we can make it really "mandatory" (i.e. reject contributions that do not have it), but we can "strongly recommend" it (e.g. by not including it in any search / index, only for direct download by full path)

a-hennig commented 1 year ago

case 4 should avoid assumption that a user is still involved, e.g.

Programmatic Consumption: A connectivity solution offers autodiscovery of devices and searches the catalog for matching Thing Models. The matching Thing Models are offered to the WoT consumer to be selected for onboarding.

a-hennig commented 1 year ago

somehow, the consumer must be able to validate authenticity of the TM (otherwise they would be perfect trojan horses)

Since the catalog does not take responsibility of the contributed TM, it must offer ways to

identify origin of a TM
provide a register of well-known authoring bodies (i.e. trusted contributors, known to the catalogue)
identify TM content - e.g. fingerprint method and relevant parts
define a way for validation of authenticity

this could use parts of the JWT/OIDC universe, e.g. X509 keys (kid), well-known keys (wkts) .. not sure if JOSE would help for fingerprinting or if we need to re-awaken the W3C discussion on TD signatures.

egekorkan commented 1 year ago

Some feedback:

We should not use WoT Consumer when we talk about the entity looking into the catalog. WoT Consumer is the entity interacting with Things. Discoverer is used for TDDs: https://w3c.github.io/wot-architecture/#dfn-wot-discoverer . We can simply say TM Discoverer maybe?
https://github.com/tum-esi/testbench can be used for runtime validation. It is based on TDs and does some input coverage strategies.
Human Validation is also a sort of semantic validation. It is about validating the meaning of the data. In one way, this can be covered in runtime if the data schemas are very precise.

daHaimi commented 1 year ago

In think the approach to use a git-based catalogue basis is a good approach, as also helm charts and gitOps strategies are going that way.

I am however opposed to make a central url mandatory (e.g. golang defaults to github but allows for alternatives). If defaulting to github but allowing for self-hosted catalogues would be an option, I would opt for identification and URL-resolution by some standard URI like purl (e.g. https://github.com/package-url/purl-spec)

It defines standards ways, e.g.

official package: pkg:wot/namespace/package@v1.2.3
self-hosted: pkg:wot/namespace/package@v2.3.4?catalogue=foo.bar.com
duplicates as branches: pkg:wot/namespace/package@v1.2.3?ref=branches/feat-my-approach

This also allows for local resolution supporting offline scenarios.

egekorkan commented 1 year ago

@a-hennig wrote:

this could use parts of the JWT/OIDC universe, e.g. X509 keys (kid), well-known keys (wkts) .. not sure if JOSE would help for fingerprinting or if we need to re-awaken the W3C discussion on TD signatures.

TD Signatures discussion is awoken! It is one of the work items of the new charter :)

daHaimi commented 1 year ago

this could use parts of the JWT/OIDC universe, e.g. X509 keys (kid), well-known keys (wkts) .. not sure if JOSE would help for fingerprinting or if we need to re-awaken the W3C discussion on TD signatures.

When we are discussing about defining a catalog and signatures alongside the catalog entries, we could also go the direction of cosign, so we could store signatures alongside the TDs.

cosign resprective sigstore have also become de-facto standard over the last years (being implemented by default in K8S management products like rancher/kubewarden), so this is my suggested signature + verification path.

hadjian commented 1 year ago

Accepted!

hadjian commented 1 year ago

need to be more precise on "semantic" quality stage.

semantics is transported via @type, which (even if it originates from json-ld) is defined via the wot jschema. i.e. the semantic stage should be split into

carries semantic information = filled @type, with namespace definition in context if needed

carries semantic information in a industry standard ontology = i.e. "is helpful", i.e. filled @type, refering to well-known ontologies like BRICK, eclass etc, not "only" product specific/local namespace

json-ld -loadable, i.e. can be loaded by a json-ld processor .. namespaces do not need to be resolveable

json-ld -fully resolveable, i.e. certain json-ld / rdf operations can be performed, because all namespaces can be resolved and contain valid data

the first two points (semantics) are independent on the last two (json-ld) .. although it is hard to imagine, that a "json-ld loadable" does not use "@type".

additionally: make sure to avoid any implied staging, e.g. "runtime requires semantic"

"mandatory metadata" .. I doubt we can make it really "mandatory" (i.e. reject contributions that do not have it), but we can "strongly recommend" it (e.g. by not including it in any search / index, only for direct download by full path)

We would like to have some human readable directory structure, so the files should at least mention the manufacturer and model. If some minimum information is omitted, we will end up with a mess, i.e. a bunch of files only usable with an index and a search API. What do you think?

hadjian commented 1 year ago

need to be more precise on "semantic" quality stage.

semantics is transported via @type, which (even if it originates from json-ld) is defined via the wot jschema. i.e. the semantic stage should be split into

carries semantic information = filled @type, with namespace definition in context if needed

carries semantic information in a industry standard ontology = i.e. "is helpful", i.e. filled @type, refering to well-known ontologies like BRICK, eclass etc, not "only" product specific/local namespace

json-ld -loadable, i.e. can be loaded by a json-ld processor .. namespaces do not need to be resolveable

json-ld -fully resolveable, i.e. certain json-ld / rdf operations can be performed, because all namespaces can be resolved and contain valid data

the first two points (semantics) are independent on the last two (json-ld) .. although it is hard to imagine, that a "json-ld loadable" does not use "@type".

additionally: make sure to avoid any implied staging, e.g. "runtime requires semantic"

It is not specified in more detail, because we will omit this in the first implementation. What I have in mind in the future is to validate the JSON-LD structure, i.e. can all terms be expanded by a jsonld processor and can the shape be validated against some shacl. We will probably accept everything in the first step, then run it through a jsonld processor as a second step etc.

hadjian commented 1 year ago

case 4 should avoid assumption that a user is still involved, e.g.

Programmatic Consumption: A connectivity solution offers autodiscovery of devices and searches the catalog for matching Thing Models. The matching Thing Models are offered to the WoT consumer to be selected for onboarding.

If it is a automatic search, there might be multiple Thing Models for one device, right? How would one of the results be selected automatically?

hadjian commented 1 year ago

somehow, the consumer must be able to validate authenticity of the TM (otherwise they would be perfect trojan horses)

Agree. There is no final concept for this, yet, but my mention of "Provenance" is aimed in this direction. When you look at golang or github you have only namespacing as far as I can tell, e.g. https://github.com/nginx. It looks like nginx is running this namespace, but I have no guarantee. With docker there is some central authority validating the vendors. I think we should have namespaces as a first step, so you know that the owner of a namespace has authority. How to make these namespaces trustable is another story. If the catalog was only valid with one git forge then we could mirror that namespace.

Since the catalog does not take responsibility of the contributed TM, it must offer ways to

identify origin of a TM

Namespace for now.

provide a register of well-known authoring bodies (i.e. trusted contributors, known to the catalogue)

Yes. Like dockerhub. Show me your ID!

identify TM content - e.g. fingerprint method and relevant parts

We plan to use a shah hash.

define a way for validation of authenticity this could use parts of the JWT/OIDC universe, e.g. X509 keys (kid), well-known keys (wkts) .. not sure if JOSE would help for fingerprinting or if we need to re-awaken the W3C discussion on TD signatures.

Good point. This whole topic deserves its own issue. Let's get a first implementation ready without all these measures and discuss in a separate issue. Thx for the feedback.

hadjian commented 1 year ago

Some feedback:

We should not use WoT Consumer when we talk about the entity looking into the catalog. WoT Consumer is the entity interacting with Things. Discoverer is used for TDDs: https://w3c.github.io/wot-architecture/#dfn-wot-discoverer . We can simply say TM Discoverer maybe?

It is probably wrong to name it WoT consumer, but it needs to be understandable by someone not familiar with the whole standards work. Let's leave it like this for this proposal. I think we should revert to something like connectivity solution or protocol adapter.

https://github.com/tum-esi/testbench can be used for runtime validation. It is based on TDs and does some input coverage strategies.

Nice. We will check this out.

Human Validation is also a sort of semantic validation. It is about validating the meaning of the data. In one way, this can be covered in runtime if the data schemas are very precise.

Not sure about this. If a number is decoded in a wrong way, e.g. byte ordering, then it is still a number. Only it is wrong. When you know which number is currently measured, then you know what to expect. Another example is a identifier stored with the device decoded as a string. If you read some register with the wrong offset, it might still return a valid string. Only that it is not the right one. I think this sadly cannot be automated.

a-hennig commented 1 year ago

case 4 should avoid assumption that a user is still involved, e.g. Programmatic Consumption: A connectivity solution offers autodiscovery of devices and searches the catalog for matching Thing Models. The matching Thing Models are offered to the WoT consumer to be selected for onboarding.

If it is a automatic search, there might be multiple Thing Models for one device, right? How would one of the results be selected automatically?

well .. that would be the problem of the consumer, it could use non-interactive methods like "first wins" or "best grade/most trusted signature wins". Or it could use local, interactive methods (buttons, commandline prompts, local UI, ,,,), with users not known to the catalogue (or not with network or logon access to it)

a-hennig commented 1 year ago

need to be more precise on "semantic" quality stage. semantics is transported via @type, which (even if it originates from json-ld) is defined via the wot jschema. i.e. the semantic stage should be split into

carries semantic information = filled @type, with namespace definition in context if needed

carries semantic information in a industry standard ontology = i.e. "is helpful", i.e. filled @type, refering to well-known ontologies like BRICK, eclass etc, not "only" product specific/local namespace

json-ld -loadable, i.e. can be loaded by a json-ld processor .. namespaces do not need to be resolveable

json-ld -fully resolveable, i.e. certain json-ld / rdf operations can be performed, because all namespaces can be resolved and contain valid data

the first two points (semantics) are independent on the last two (json-ld) .. although it is hard to imagine, that a "json-ld loadable" does not use "@type". additionally: make sure to avoid any implied staging, e.g. "runtime requires semantic"

It is not specified in more detail, because we will omit this in the first implementation. What I have in mind in the future is to validate the JSON-LD structure, i.e. can all terms be expanded by a jsonld processor and can the shape be validated against some shacl. We will probably accept everything in the first step, then run it through a jsonld processor as a second step etc.

if we already intend to implement it later (and it sounds like with rather specific tests), then we should take the time to specify.

Your test sounds more like "fully resolvable json-ld, processeable by shacl", which is far more than "semantics given, json-ld parseable/loadable (my #1,#2,#3)" because it also requires that all used namespaces are fully resolveable (#4) and processeable by shacl (and additional #5) - and therefore (probably?) needs to specified as shacl or with provided shacl rules (#6).

Your test would be one valid use and toolset of semantics, but not the only one, so we should not take assumptions (or not document them clearly as an option) ... and not leave them in a list that looks like one line building on another, i.e. later might suggest an mandatory point.

I think we should group our grades into the 3 groups defined by https://www.w3.org/TR/wot-thing-description11/#validation-serialization-json (minimal, basic, full) and detail the sub-grades explicitly

It can do basic validation of extensions, specfically that the vocabulary used is defined. --> my #3, defined

In this case, context definition files and SHACL definitions can be used to validate additional assertions and check TDs for semantic consistency. --> my #4, resolveable

In addition, if context definitions and SHACL constraints for extension vocabularies can be fetched, then these can be used to validate extensions --> your #5, shacl constraint compliant

hadjian commented 1 year ago

case 4 should avoid assumption that a user is still involved, e.g. Programmatic Consumption: A connectivity solution offers autodiscovery of devices and searches the catalog for matching Thing Models. The matching Thing Models are offered to the WoT consumer to be selected for onboarding.

If it is a automatic search, there might be multiple Thing Models for one device, right? How would one of the results be selected automatically?

well .. that would be the problem of the consumer, it could use non-interactive methods like "first wins" or "best grade/most trusted signature wins". Or it could use local, interactive methods (buttons, commandline prompts, local UI, ,,,), with users not known to the catalogue (or not with network or logon access to it)

Valid points. I remove the assumption of a manual selection from the scenario. Do you think it will have any impact on the implementation, i.e. would the API differ for the scenarios?

a-hennig commented 1 year ago

case 4 should avoid assumption that a user is still involved, e.g. Programmatic Consumption: A connectivity solution offers autodiscovery of devices and searches the catalog for matching Thing Models. The matching Thing Models are offered to the WoT consumer to be selected for onboarding.

If it is a automatic search, there might be multiple Thing Models for one device, right? How would one of the results be selected automatically?

well .. that would be the problem of the consumer, it could use non-interactive methods like "first wins" or "best grade/most trusted signature wins". Or it could use local, interactive methods (buttons, commandline prompts, local UI, ,,,), with users not known to the catalogue (or not with network or logon access to it)

Valid points. I remove the assumption of a manual selection from the scenario. Do you think it will have any impact on the implementation, i.e. would the API differ for the scenarios?

no ... except that one has to be careful on assumptions with session/tokens/browsers/redirect ... can be assumed to be there. Still same API, but a bit more to test.

wot-oss / proposal