WAF as a model provides semantic benefits, but implementation is entirely optional

petersilva commented 2 years ago

the current proposal uses a model of web accessible folders as:

an explanatory model,
a means of providing a trivial implementation
an efficient implementation option.

however: None of the above make the implementation of a WAF mandatory for any node participating in WIS2. It is an easily explained model, that's all.

Nominally, items in the WIS2 network are accessed using two fields:

baseURL ( #82) -- indicating the server and the base of the tree where the WAF begins
relPath ( #80 ) -- the relative path below the base where records are found.
In the various committees, people have agreed that topics are a useful concept, and building a topic hierarchy is a recognized activity being pursued.
Topic are being combined with authentication as a useful organizing principle for access control. That is permissions are being modeled as giving access to subtrees in the topic hierarchy.

Topics are semantically identical to file folders.

In order to access a record in any database, it must have an id, or a tag. We have the choice to make the tag relative to the dataset pointed to by the topic, or an absolute generic one. A topic relative id will be able to be much smaller in size than one that has to avoid collision with all possible products in all possible topics.

The relPath, therefore proves a minimal length record identifier for every item in each data collection. One can add additional metadata to the record identifier by making that id a meaningful name, or just have a random hash, but that is a separate discussion.

petersilva commented 2 years ago

An API allows someone (an front end user) to query and access information so long as they know the URL of the API without needing know any individual URIs (pygeoapi is relatively unique in that it generates a URI for each feature). On the other hand, a WAF allows someone (a scientist) direct access to 'physical' files with low overhead when accessing entire datasets or a specific URI. Any contributions to making the data in a WAF more explorable and discoverable is done by adding structure to the WAF. Some scenarios that come to mind:

A wis2box serving the topic hierarchies: data.core.observations-surface-land.mw.FWCL.landFixed and data.core.observations-surface-land.ca.FWCL.landFixed will have all information related to data.core.observations-surface-land... as subdirectories of it in the WAF. The API serves each topic hierarchy as a separate collection with no explicitly defined relationship beyond semantics.

petersilva commented 2 years ago

number that would need to be backed up by some experiments... thinking about big items, like RADAR, satellite imagery, nwp outputs, a raw data item on disk served by a web server consumes orders of magnitude less resources (cpu, memory) than satisfying the same request for the entire data item via an OGC-API request. They are different and complementary needs. if someone wants raw data, then it is orders of magnitude more efficient to provide the raw data than it is to reproduce the raw data via processing through an OGC stack.

That is my feeling... but I have no numbers to back it up. I have 2 servers providing 50 million hits/day of WAF, and an OGC-server that falls over and requires multiple ones at a small fraction of that load. But that is an anecdote.
Anyways. I need to emphasize that implementation of a WAF is easy and I think it is useful, but people are free not to implement WAF, but any proposal that prevents WAF is a real problem for me.

petersilva commented 2 years ago

The use of this model does not constrain use of web services, but explicitly allows the use of alternate url's by means of the retrieval one. but the relPath is still needed as the unique record identifier, and it contains the relative topic, which is what people need for data classification purposes.

one could argue that baseURL could be just combined into the retrieval URL, and not be a separate item, but there is no way to parse the baseURL out of the retrieval URL, so information is lost by that combination.
one could could argue that separating the "id" from the relative topic tree and have separate fields but:
- it is trivial to extract the "file part" of a URL using facilities in all programming languages.
- the id is in the name space provided by the topic hierarchy, so it makes no sense to take it out.

golfvert commented 2 years ago

Overall agreed. However using semantically meaningful elements (relPath) may have unnecessary consequences. So, my order of decision would be:

geojson or not ? If yes,
id and not relPath
geometry and not bbox

petersilva commented 2 years ago

see https://github.com/wmo-im/GTStoWIS2/issues/81 which achieves geoJSON compatibility with no other semantic changes.
id in all other usage I can see is non-hierarchical. Using id for for a hierarchical name is therefore confusing. we need to store topic somewhere.
ok... as in #81 I am approaching TT-protocols to give up on bbox and adopt geometry.

petersilva commented 2 years ago

I'd also like to point out that ID used within a single GeoJSON record and that using it as a index into a global database is abuse of geoJSON.

https://github.com/qgis/QGIS/issues/40805

using id that way is not at all consistent with normal GeoJSON.

golfvert commented 2 years ago

"id" here is not an index to a global database. It is "just" a unique identifier. This is is a uuid/guid as per RFC4122 From geojson RFC:

  o  If a Feature has a commonly used identifier, that identifier
      SHOULD be included as a member of the Feature object with the name
      "id", and the value of this member is either a JSON string or
      number.

petersilva commented 2 years ago

yes. the id is within a single GEOJSON file... so totally inappropriate for uniqueness beyond the scope of a single file.

golfvert commented 2 years ago

It is is for dedup of messages... I may miss something but I can't see how in a partially meshed setup, we can discard messages with re[t,l]Path and baseURL only. We need something (whatever it's name) to uniquely identify the message. However, the message for the same file issued by each cache must be different. With an "id" created by the producer/publisher of the file (hence the message) we can have this.

petersilva commented 2 years ago

When a message about a datum is received, it has a relPath and an integrity checksum. If you receive a second message with those fields the same. It is a duplicate to be discarded.

That's it.

golfvert commented 2 years ago

So, for the same "data" relPath should always be the same, but, based on issued time of the message the integrity checksum will be different. OK. It works but two things to check and not just same "id".

petersilva commented 2 years ago

Time is not involved with integrity check sum. So far integrity checksum is a sha512 digest of the data. Raoult mentioned a new concern about not wanting to produce a checksum, which can be addressed by additional integrity methods, but have not had the chance to raise that with tt-protocols yet.

petersilva commented 2 years ago

name
      "id", and the value of this member is either a JSON string or
      number.

id can be a number... does that sound like universal identifier to you? It is clear from the spec that it referring to a relative tag within a single file.

golfvert commented 2 years ago

Announcing an API (which by design can't have a checksum) means that we have to introduce yet another style of unique id. I am really failing to see the benefits for us (WIS2). I see WAF being more an unnecessary constraints (relPath...) that something useful. Sorry. E.g. using Links avoid having to glue baseURL with relPath (or retPath) to get the download details.

How with only relPath and checksum of the data, messages for the same data will be different for the origin Center and the X caches that we will have?

petersilva commented 2 years ago

Announcing an API

if you announce that a new radar volume scan, it is impossible to produce a checksum for it? Why? Announcements should be of specific products, not general availability. If the API call is the same regardless of time, then that is discovery metadata and just an OGC-API query. It does not make sense to announce that values are available at this end point, because always are there is no value to sending out the same API call over and over.

On the other hand, good practice with API design will mean that the API sent out in an annoucement will be a specific product (e.g. REST), so it will include some additional means to identify it as a unique product, in which case it does refer to a specific value, and in the vast majority of cases, calculating a checksum should not be a burden.

Please provide examples of API for which checksums cannot be produced, so we can understand the issue better.

links

Links can be constructed trivially by gluing two fields together, but that isn't reversible, there isn't any way to get the baseUrl back out. So making that change cripples a use case, vs. other uses being mildly inconvenienced.

A file can be modified. Newer versions of files have different checksums. so if you get a file with the same name and a different checksum then it should not be discarded, but instead the new version should be downloaded. Objects can have versions and identifying that you have the latest version would need a similar mechanism.

kaiwirt commented 2 years ago

I agree, that it is maybe good to support WAF and having the messages such that we can support different use cases.

However i object posing requirements on filenames or directory structures to nodes providing data. And it is a requirement for WIS2 to support APIs where the data is maybe generated only when the request is made. I can think of such cases where new raw data is arriving which needs to be announced via MQP and then products are generated from that raw data upon request.

So to sum up: Having messages to allow WAF: yes Requiring WIS Nodes to implement WAF: no

kaiwirt commented 2 years ago

A word on checksums: If we use SHA256 or SHA512 or any other cryptographic hash value, then this is sufficient to determine if data is unique. Collision resistance is one of the design principles for these kind of checksums. So we won't ever generate to different data sets having the same (cryptographic) checksum. At least not in realistic time horizons.

Maintaining a list of checksums a center already received in the last x days is enough for data deduplication.

petersilva commented 2 years ago

@kaiwirt yes agreeing with both points. however clarification:

People do want a topic hierarchy for many reasons.
to make the "id" field (regardless of whether it is automatically generated, or resplendent with metadata) should be relative to the data topic... limiting the id name space to being within each topic.

The above two are entirely equivalent to "relPath." That's all relPath was ever intended to be. Can you provide an example that you have in mind about products announced in MQP that are generated upon request? I bit more description to understand the case better would be helpful.

kaiwirt commented 2 years ago

I can think of a Web Map Service. You use MQP to announce, when new model data is available and end users use the service to download temperature plots.

Another use case might be that the raw data does not contain certain elements like QNH or perceived temperature, but the service calculates QNH from Temperature, Pressure, ... or calculates perceived temperature from Temperature, Wind, Humidity.

What i meant is that these products are not pre-computed but are computed by the Web Service when they are requested.

petersilva commented 2 years ago

The purpose of the checksum is for deduplication, and is meant to represent download of the entire, raw product. so if the announcement goes to two brokers and a downstream broker receives the message from both sides, it needs to understand if it is a duplicate or not. It does not actually need to be a data checksum, but ideally it would be something calculable so that different systems with the same datum derive the same value. That different methods are appropriate for different applications is the motivation for "integrity" having a "method" property, so that while sha512 is the default, the producer can choose other methods, as long as they are well-known (by producer and all consumers.)

In the Canadian stack, we have additional integrity methods implemented, which is just to use the name of the product, rather than a data checksum... which is used in some cases... in other cases, we have "arbitrary" where the application can define it's own opaque integrity value. These have not been brought to wmo in the past because the cases were not raised at tt-protocols.

petersilva commented 2 years ago

Bootstrap Problem

There is also the case of working with "foreign" data, not really foreign, but simply data that has not been classified yet. If we constrain the semantics to only deal with with a known topic hierarchy, and require strict conformance to it, then we cannot create a message for anything else.

If you receive a text file observation, and it has not been understood and annotated by an metadata rich system such as wis2box... you don't have enough information to create a message, so the message cannot be ingested in order to provide that metadata.

golfvert commented 2 years ago

I would say the exact opposite. We don't want "foreign" data without proper metadata to be exchanged using WIS2 solution. In the GTS2WIS (and WIS2GTS) aspect, we will work on this transition. However, I think that, by design, we want to impose a solution where metadata is a must to make data available on WIS2.

petersilva commented 2 years ago

Completely agree with "we don't want foreign data without proper metadata to be exchanged using WIS2" but how does any data get into WIS2 in the first place? In wis2box, we have surface obs. they are initially .csv files with no metadata. Current mqp solution creates messages for them so wis2box can ingest them, determine the necessary metadata, and then it is fit to travel further. If you require metadata before a message can be made then it needs all upstream sources must be able to create fully wis2 compliant messages. An FTP server, for example cannot produce WIS2 mqp messages... It makes transition much more difficult.

petersilva commented 2 years ago

further discussion here: https://github.com/wmo-im/wis2-notification-message

wmo-im / GTStoWIS2

WAF as a model provides semantic benefits, but implementation is entirely optional #83

Announcing an API

links

Bootstrap Problem