wmo-im / GTStoWIS2

Conversion of GTS headers to WIS2 topic
GNU General Public License v3.0
8 stars 5 forks source link

ET-W2AT asking for new message format. #78

Closed petersilva closed 2 years ago

petersilva commented 2 years ago

Well, as per committee consensus, I asked for geographic and temporal extent advice from ET-AT, and they replied by asking us to use GeoJSON format outright with a heavy re-work of elements of the payload that changes semantics significantly.

https://wmo-teams.atlassian.net/wiki/spaces/WIS2/pages/322568193/2022-04-04+ET-W2AT+Meeting

The ET-AT suggestion reject and/or discards all work on message format by TT-protocols since it's inception, and makes conceptual and semantic changes whose effect are unknown, necessitating a return to design phase, and re-validation of all existing work. There are many issues raised:

All of which can be done, but requires investigation and validation, and will take some time to work through. A minimum impact alternative #81 which just adds fields to the current payload is far more straightforward, and could be done immediately.

We should put a link in to Jeremy's slides, I'm just going from memory from what was presented for now:

changes:

good:

observations:

  Python 3.10.4 (main, Apr  2 2022, 09:04:19) [GCC 11.2.0] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import geojson
  >>> my_point = geojson.Point((43.24, -1.532))
  >>> my_point['relPath'] = 'the/path/of/interest/tome.txt'
  >>> dump = geojson.dumps( my_point )
  >>> dump 
'{"type": "Point", "coordinates": [43.24, -1.532], "relPath": "the/path/of/interest/tome.txt"}'
  >>> 

not clear why the rest of the reformatting is necessary if software that does not understand the messages, will still not understand it regardless of it's placement in "properties", and software that parses geojson will still understand the existing payload in so far as it will already be geoJSON, to the same degree that STAC is. I suppose someone will argue that such GEOJSON is "unusual" ... I wonder if, in this case, being unusual is an advantage, it that it is unlikely to be mistaken for ordinary GEOJSON.

Worries/Dangers:

petersilva commented 2 years ago

73 opened for the addition of version to the payload.

petersilva commented 2 years ago

80 opened for adding "topic" to payload.

petersilva commented 2 years ago

81 opened for minimal transformation to achieve ET-AT's ask.

antje-s commented 2 years ago

An extension with additional geo information and type should not be a problem from my point of view. The message size does not increase significantly (in default case) and if we comply with a general standard we have a clear advantage. However, we should possibly set a maximum message size so that we do not have problems in the long run and a basis to reject message with large sizes. An extension to include geo information may not be widely used at the moment, but offers potential for further development. A "wrapping" of our content values with "properties" to be more familiar to the geoJSON format is also not much change

{ "type": "Feature", "geometry": { "type": "Point", "coordinates": [102.0, 0.5] }, "properties": { "prop0": "value0" } }

Also here I think important is a limit of the total size, because after geoJSON there seems no limit is set for further properties.

A download link or a service url is core content, but I think that probably in the WMO world also the repository systems and their repository structures will differ and are not necessarily oriented to the topic structure, therefore a total value in one field or other names of fields that has to be concatenated seems to be ok. The original idea that only the baseUrl has to be adjusted to own values seems to be not practicable in the diverse environment of WMO Nodes. Since a split into baseUrl and relPath is currently used in the pilot, the question would be what reasons speak against keeping the two fields to be merged for link?

However, an id or a unique filename value is very helpful for possible error searches and quality checks. We should be able to clearly identify content among each other. If someone asks if content x has arrived, it must be possible to quickly check if it is there. As the storage structures could differ, there must be some kind of id that makes this possible.

petersilva commented 2 years ago

The unique filename (as represented by the relPath.. potentially renamed) is either:

The two are equivalent.

If only a complete Url is provided, the baseUrl cannot be extracted for it, so any use case that needs baseURL is broken.

petersilva commented 2 years ago

84 for more comprehensive means of addressing comments about Integrity.

petersilva commented 2 years ago

In summary: The ET-AT suggestion reject and/or discards all work on message format by TT-protocols since it's inception, and makes conceptual and semantic changes whose effect are unknown, necessitating a return to design phase, and re-validation of all existing work. There are many issues raised:

All of which can be done, but requires investigation and validation, and will take some time to work through. A minimum impact alternative #81 which just adds fields to the current payload is far more straightforward, and could be done immediately.

josusky commented 2 years ago

Yay, I almost thought that we will get bored.

petersilva commented 2 years ago

@antje-s agree just moving the fields around is not a problem... it's actually minor and already implemented (the feed from hpfx has v03 (tt-protocols original flavour messages) and v04 (geojsonized ones) available now.. just pick v04 as the topicPrefix and geoJSON goodness at your fingertips. geoJSON isn't the issue, it's the wholesale replacement of fields by other fields with different semantics, the banning of file names (which basically forces WAF to use the message id as the file name) that they don't want paths on WAF to be representative... So I guess we should be generating random trees with id's in them, as the base case.

josusky commented 2 years ago

I do not have access to https://wmo-teams.atlassian.net/wiki/spaces/WIS2/pages/322568193/2022-04-04+ET-W2AT+Meeting @petersilva This is a groundbreaking change and I do not understand the reasons. And I already see significant drawbacks. Who proposed this and why? Could we have a teleconference about it?

petersilva commented 2 years ago

There was a meeting in Geneva (I wasn't there) and the group decided this was a good idea. the notes are from there... I think the content of this issue accurately reflects their request. In subsequent ET-AT meetings I tried to explain the implications, but people seem fixated on "WAF is ugly", so so everything else I say seems to be noise to them. They claim breakage.. it was N vs. 1... so ... whatever...

I'm implementing it for folks to see, and they can judge based on the result of their recommendations. I don't support these changes at all... (note: geojson is fine... it's the rest) but I'd rather implement than argue.... It's a test feed... no harm done.

People will gradually re-discover that we originally had an optimal, minimal solution... or we'll end up with some baroque rube Goldberg thing... where they keep adding things because it doesn't work because they don't understand the semantics that were implicitly taken care of by the original... whatever... out of my control. I think people think I'm difficult now, so would prefer if others spoke.

As you can see. I'm piling up the issues, and we will certainly start talking about it at the next teleconference. with this new development the committee has a lot more work to do.

josusky commented 2 years ago

@petersilva The notes that I cannot access (but I have sent a request so maybe I will get access) ;-) Now, in parallel, I was chatting with Tom Kralidis. He pointed out that https://geojson.org/schema/Feature.json does allows "null" geometry. So

{
  "pubTime" : "20190120T045018.314854383",
  "baseUrl" : "https://localhost/data/20190120",
  "integrity": {"method": "sha512", "value": "A2KNxvks...S8qfSCw=="},
  "relPath" : "WIS/CA/CMC/UpperAir/04/UANT01_CWAO_200445___15103.txt",
  "size": 457,
  "content" : { "encoding": "utf-8", "value": "encoded bytes from the file" },
  "retPath" : "4Pubsub/92c557ef-d28e-4713-91af-2e2e7be6f8ab",
  "type"    : "Feature",
  "geometry": null,
  "properties": null
}

is a valid GeoJSON. This said, almost any JSON can be trivially converted into a valid GeoJSON, but that does not make things interoperable. I am going to do some analogies - It is almost like telling: "your software produces XML my software consumes XML we are interoperable". For some data, a CSV is much better than XML because CSV is much more limited and it hints that the data is tabular. Similarly, using our primitive "WMO mesh message-schema" makes things clearer than the use of a much more generic format. We can refer to https://geojson.org/schema/Geometry.json as I did in the issue081 (I will probably do one more alternative there) but we should not market it as GeoJSON because that will just confuse everybody.

petersilva commented 2 years ago

To be clear, I think using geojson as the format for mqp messages is fundamentally wrong, as geojson is a data format, and most people will reasonably interpret such information as meaning that one should post geojson data as mqp messages. One will then always have to have a second conversation about sizes, and feature limitations or "profiling" ... and at that point why bother calling it geojson at all?

antje-s commented 2 years ago

@antje-s agree just moving the fields around is not a problem... it's actually minor and already implemented (the feed from hpfx has v03 (tt-protocols original flavour messages) and v04 (geojsonized ones) available now.. just pick v04 as the topicPrefix and geoJSON goodness at your fingertips. geoJSON isn't the issue, it's the wholesale replacement of fields by other fields with different semantics, the banning of file names (which basically forces WAF to use the message id as the file name) that they don't want paths on WAF to be representative... So I guess we should be generating random trees with id's in them, as the base case.

@petersilva switched the client to v04 as a test and it works

petersilva commented 2 years ago

I was mistaken...btw... I believe the ET-AT proposal is against id being the file name as well, so some other random name must be generated. The Canadian feed implements random file names as well (different names from message-id's)

josusky commented 2 years ago

I have commented on some of the referenced issues where I had the impression that my view might be helpful. Here I would like to comment on the high-level topics:

  1. Do we want to make the data notifications in WIS2 to be valid GeoJSON documents? I do not think so. The format is very generic and has only 3 required properties and at least two of them can be "null" thus two GeoJSON documents can differ in all but one property. At the same time, in certain domains the GeoJSON is used for actual data and consequently specialists from those domains may get confused and assume that notification messages in WIS2 will contain specific things - either values or formats of the values, that they are used to from their specific GeoJSON sub-specifications (e.g., compare "time" and "display_time" in http://agora.ex.nii.ac.jp/digital-typhoon/geojson/wnp/201601.en.json with our format).

  2. Does it make sense to use "GeoJSON Geometry" in the data notifications? Yes, but.

    • If there is a reason (possibility) to assign a geometry to a notification, then it definitely should be something well known, that is https://geojson.org/schema/Geometry.json.
    • On the other hand, we shall make it clear that it is optional. There might be some rare data that does not have a clear association with a geographical location. But more importantly, in the transition period, we will republish a lot of data from legacy systems that do not provide geographical information explicitly - and its deduction from actual data could be time-consuming, unreliable, or both.
    • It seems to me that for some applications a bounding box (https://datatracker.ietf.org/doc/html/rfc7946#page-12) might be more natural - but I admit that it can be trivially transformed into a polygon so that is not so relevant.
    • Finally, wherever possible, the filtering should be done server-side, that is, if the data source/service provides data for several areas/locations, it is preferable that it publishes notifications about them under separate (sub-)topics, so that data consumers can subscribe to just the relevant (sub-)topic, as opposed to having just one topic for all and then filtering the notifications after reception at the client.
josusky commented 2 years ago

I have added some examples to https://github.com/wmo-im/GTStoWIS2/tree/issue078/message_format

I had troubles with expressing the nullability of the geometry in a way that would be accepted by https://www.jsonschemavalidator.net. I will check this later on. Moreover, I had issues with "date-time". That is a known problem that this type in JSON means RFC 3339 format that is a somewhat lame subset of ISO 8601 (not only it is longer, but it does not allow higher precision than milliseconds). In my opinion ISO standards are more, er, standard then RFC, but for now I have used the RFC format.

petersilva commented 2 years ago

in example 2, you use a templated url value, hierarchy, from properties:

petersilva commented 2 years ago

subsequent discussions with ET-AT, reflected here: https://wmoomm.sharepoint.com/:b:/s/wmocpdb/Ech78Pb1k_9Jvde0uB-Ir0MB4giLGIaMBjelE0sETNhYRw?e=0ehXK5 brief summary of updates:

As this changes the proposal substantially. likely further discussion with multiple different proposals at different points in the issue will be hard to follow. We will close this issue. in favour of #90 to discuss only the revised proposal.