wmo-im / WIS

WMO Information System
MIT License
1 stars 0 forks source link

CF-NetCDF experimental data exchange #1

Open efucile opened 3 years ago

efucile commented 3 years ago

WMO-CF profiles are entering an experimental phase. The aim of the experimental phase is to provide NWP centers and other users the data in real time for testing and collect feedback to review the provisions that are going to be approved as part of the Manual on Codes. The simple way to reach all the NWP Centres and to test the WIS data exchange is to provide GTS headings for the data and exchange them on GTS.

efucile commented 3 years ago

@wmo-im/et-data @wmo-im/et-om @remigiraud please comment on this

hermannasensio commented 3 years ago

a second way to exchange netcdf files in the WIS could be using the WMO file name convention (see Manual on GTS, WMO no 386, Attachment II-15) and a WIS-Metadata-set. Maybe an account at GISC Offenbach (or any other GISC) will be needed.

SimonElliottEUM commented 3 years ago

@hermannasensio @efucile We could do both at once, and put GTS bulletins with appropriate abbreviated headers inside files named according to the WMO file name convention and a WIS-Metadata-set

benjaminsaclier commented 3 years ago

The advantage of the WMO file name convention is to not use TTAAii, it would be pity to 'corrupt' the netcdf file by adding a header inside which should be remove at least for processing. We already exchange many WMO file on GTS without headers (for ex satellites products).

efucile commented 3 years ago

a second way to exchange netcdf files in the WIS could be using the WMO file name convention (see Manual on GTS, WMO no 386, Attachment II-15) and a WIS-Metadata-set. Maybe an account at GISC Offenbach (or any other GISC) will be needed.

I like this proposal but for a user would be difficult to get all the netcdf data needed by digging in the catalogue

efucile commented 3 years ago

I would like to bring here the user point of view. We need to setup the data exchange in a way that as a user I can get all the data of a specific type in real time and without having to consult the WIS catalogue every day to see how many data I am missing. Usually we do this on GTS by asking to have routed all the data with a TTAA pattern. Can we do this? @wmo-im/et-data @wmo-im/et-om @remygiraud

kaiwirt commented 3 years ago

I want to second Hermanns point. Eventually we have to quit using TTAAii headers. If we keep on using TTAAii headers even for new experimental data exchanges we won't make that necessary step towards WIS2 in the near future.

josusky commented 3 years ago

I would like to mention a paralel effort of the TT-Protocols to replace AHLs (TTAAii CCCC) by a hierarchy that can be interpreted either as relative path or as a topic for exchange of notifications over pub-sub protocols (https://github.com/wmo-im/GTStoWIS2). The idea is that the data (product), would be identified by such "path", thus all products would create a logical tree from which the data consumers could select sub-trees (or just leaves) as needed. This can be also seen as an evolution of the WMO file naming convention that brings more flexibility.

blchoy commented 3 years ago

I would like to mention a paralel effort of the TT-Protocols to replace AHLs (TTAAii CCCC) by a hierarchy that can be interpreted either as relative path or as a topic for exchange of notifications over pub-sub protocols (https://github.com/wmo-im/GTStoWIS2). The idea is that the data (product), would be identified by such "path", thus all products would create a logical tree from which the data consumers could select sub-trees (or just leaves) as needed. This can be also seen as an evolution of the WMO file naming convention that brings more flexibility.

I think the "path" is a sensible way of (uniquely) identifying objects in the data/product space. The only drawback I can see is that it could be difficult to change the hierarchy when it grows to a certain size and complexity without affecting users. Having said that, I don't think it is going to be a huge problem for us as "our universe" is well described.

kaiwirt commented 3 years ago

Question from TT-GISC: Is there an estimate for the data volume? High volume high frequency data can be a problem on the (legacy) GTS

efucile commented 3 years ago

Answering to @kaiwirt on @hermannasensio proposal and @josusky on the discussion on TT-Protocols.

  1. We are developing WMO-CF profiles because we have a requirement to have these data available on WIS.
  2. We are planning to start experimentation this year and the experimental data (which are going to become operational soon) have to be available to the wider community (including NWP Centers) via WIS.
  3. The user requirement can be expressed with the following user stories. I am a user and I need to process in real time all the glider data in netCDF. I am a user and I need to process in real time all the radar data from Australia.
  4. If the proposal from @hermannasensio can satisfy the user stories we need to document how and the Secretariat will be ready to monitor the exchange
  5. If not we need to have another solution. I have the impression that there is a mood of stopping GTS NOW, but we cannot do that. A feasible solution would be to have a demonstrator project where the data are provided with pub/sub protocols. The data will never be exchanged on GTS and will be the first dataset to be exchanged on WIS2. However, this will require allocation of resources for the data producers to develop a pub/sub platform. Details and availability from the producer side to be discussed.
tbuessel commented 3 years ago

I like this proposal, too!

-----Ursprüngliche Nachricht----- Von: Hermann Asensio notifications@github.com Gesendet: Mittwoch, 20. Januar 2021 16:33 An: wmo-im/WIS WIS@noreply.github.com Cc: Thorsten Büßelberg Thorsten.Buesselberg@dwd.de; Team mention team_mention@noreply.github.com Betreff: Re: [wmo-im/WIS] CF-NetCDF experimental data exchange (#1)

a second way to exchange netcdf files in the WIS could be using the WMO file name convention (see Manual on GTS, WMO no 386, Attachment II-15) and a WIS-Metadata-set. Maybe an account at GISC Offenbach (or any other GISC) will be needed.

— You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVo+NTEzPiJyOTYqNCJtYDk0NT40NSJ3bWNqZXBxdmE5YD08PD03MzJnYjUxNjE1NmAyPWFhNTw3ZmYzN2AyNjxnNzw0MTw2ZiJwOTUyNTU1MTI8NjcidW1gOTU0T0JcbDdoNDYwMTEwKTU0T0JcbDdpNDYwMTEwInZndHA5UGxrdndwYWoqRnFhd3dhaGZhdmNEYHNgKmBhImc5MTYibGBoOTQ=&url=https%3a%2f%2fgithub.com%2fwmo-im%2fWIS%2fissues%2f1%23issuecomment-763713118 , or unsubscribe https://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVo+NTEzPiJyOTYqNCJtYDk0NT40NSJ3bWNqZXBxdmE5Z2JmPGVhPTc1MzZlMj1nMj0yYGFnYGcxZTRgMjViNjNnMTMzPGE9NSJwOTUyNTU1MTI8NjcidW1gOTU0T0JcbDdoNDYwMTEwKTU0T0JcbDdpNDYwMTEwInZndHA5UGxrdndwYWoqRnFhd3dhaGZhdmNEYHNgKmBhImc5MTYibGBoOTQ=&url=https%3a%2f%2fgithub.com%2fnotifications%2funsubscribe-auth%2fARPSYO7TWBFLJRFGSSMWWODS23ZUPANCNFSM4WKVL6JQ . <https://ofcsg2d vf1.dwd.de/fmlurlsvc/?fewReq=:B:JVo+NTEzPiJyOTYqNCJtYDk0NT40NSJ3bWNqZXBxdmE5YjAxNGJgYDdhYjVlMmAwMz0wYjZnZTUwNDAzYWBiNGIyMjcwMGFnPCJwOTUyNTU1MTI8NjcidW1gOTU0T0JcbDdoNDYwMTEwKTU0T0JcbDdpNDYwMTEwInZndHA5UGxrdndwYWoqRnFhd3dhaGZhdmNEYHNgKmBhImc5MTYibGBoOTQ=&url=https%3a%2f%2fgithub.com%2fnotifications%2fbeacon%2fARPSYO3QFO7B5AW2PCQNWHLS23ZUPA5CNFSM4WKVL6J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFWCVMXQ.gif&fmlBlkTk>

petersilva commented 3 years ago

constraints of note for exchanging such data over GTS:

Transfer using the mechanisms described in wmo-im/GTStoWIS2 would allow:

Assuming every member has a web server they can place files on, the main complication of the method would be the need for each member originating data to install a rabbitmq broker (It's free software, but it is a particular brand, as mature stacks are using that for now.) somewhere as a means of transmitting real-time notifications of file availability to subscribers. The wmo-im/GTStoWIS2 is actively working on vetting additional stacks as options, but for now rabbitmq is the best choice.

efucile commented 3 years ago

constraints of note for exchanging such data over GTS:

  • maximum size of any item on GTS is 500,000 bytes. As segmentation has been deprecated a decade ago, the GTS simply cannot transmit anything larger than that.
  • need to ensure such headers are forwarded as binary and not alphanumeric.
  • each intermediate RTH must be involved in routing between any two NC's. no such thing as point to point links on GTS.

Transfer using the mechanisms described in wmo-im/GTStoWIS2 would allow:

  • no size limit. members just put files on a web or file server.
  • all data is "binary" no need to differentiate in transmission system.
  • Each member advertises the products they have available. Direct peer to peer is possible, as well as more elaborate topologies with intermediaries.
  • yes, of course it is real-time.
  • avoid over-taxing the GTS... kind of creaky, usually implemented with old equipment ill-prepared for high volume.

Assuming every member has a web server they can place files on, the main complication of the method would be the need for each member originating data to install a rabbitmq broker (It's free software, but it is a particular brand, as mature stacks are using that for now.) somewhere as a means of transmitting real-time notifications of file availability to subscribers. The wmo-im/GTStoWIS2 is actively working on vetting additional stacks as options, but for now rabbitmq is the best choice.

Thank you, @petersilva this is a very useful summary.

benjaminsaclier commented 3 years ago

Are u sure about the 500 000 bytes rule on GTS ? I do not find anything in the Manual of GTS regarding a max size of 500 K and we are already receiving for ex BUFR products > 500K from GTS.

SimonElliottEUM commented 3 years ago

See §2.7.1 (c) in Manual on the GTS: "The limit for meteorological bulletins for binary data representation or pictorial form shall be 500 000 octets"

benjaminsaclier commented 3 years ago

Ok thanks . In my understanding it a constraint for bulletins size but not for file size . So if we use general file naming convention for these data it should be ok. Anyway I think it would be useful to update the manual because in "real life" many bulletins > 500k are already exchange on GTS since many year.

KenRJTD commented 3 years ago

And, §2.7.1 (d) in Manual on the GTS: (d) Sets of information may be exchanged using the file transfer technique described in Attachment II-15, particularly where sets larger than 250 000 octets are concerned.

petersilva commented 3 years ago

for clarity... I was referring to GTS in terms of the socket protocol. File transfer is a different story, but all my experience "GTS" transfers using files was just using the bulletin aggregation specification, so it inherited the limitations of the sockets. For real file data, We have always used bilateral arrangements. But my experience may be unique.

KenRJTD commented 3 years ago

Other aspects we should consider is GTS bandwidth. If the test data is too large, operational data (observation, NWP products, warnings etc.) will be delayed. Need more information below.

petersilva commented 3 years ago

um... if the broker is a barrier... I can investigate providing accounts on our public broker for experimental usage, to make it easier to get going. I imagine this is a latency torture test in the example where Australia advertises files in Canada, that Japan would pick up, but functionally it should work. The actual data transfer would be direct, this is only messages telling subscribers that the new product is available.

My guess is that this would traverse public internet connections, avoiding GTS dedicated ones. However, our GTS is over public Internet, so it does not help. The canadian stack has a kbytes/s setting to limit bandwidth usage, with the obvious impact of delaying "real-time" transfers, but it could be used if a conflict with operational data arises. Many commodity downloaders (curl, wget) have similar.

blchoy commented 3 years ago

Assuming every member has a web server they can place files on, the main complication of the method would be the need for each member originating data to install a rabbitmq broker (It's free software, but it is a particular brand, as mature stacks are using that for now.) somewhere as a means of transmitting real-time notifications of file availability to subscribers. The wmo-im/GTStoWIS2 is actively working on vetting additional stacks as options, but for now rabbitmq is the best choice.

We are also seeing if this is also the way to distribute MET information for the aviation community over the ICAO SWIM (System Wide Information Management) environment. Even though a web server can handle a lot of simultaneous connections, the surge of incoming requests to the web server, after broadcasting the availability of certain data/product through MQ, may give rise to pulsating bursts of traffic which needs a special strategy to handle.

efucile commented 3 years ago

@wmo-im/et-om @tbuessel do we need a short meeting to discuss this and take a decision? Given the discussion my proposal would be to have these data as the first experimental data stream on WIS2 MQP in this way we will have the opportunity to experiment the new data standard and the new data exchange at the same time and provide a live example of WIS2 in action. If we have enough consensus on this we can just decide here and avoid the meeting.

hermannasensio commented 3 years ago

@efucile now I have spoken with Thorsten, a short meeting is an good idea

efucile commented 3 years ago

@hermannasensio and @tbuessel I am going to organize a meeting to discuss only this topic.

tbuessel commented 3 years ago

Perfect and thank you very much!

Von: Enrico Fucile notifications@github.com Gesendet: Freitag, 22. Januar 2021 10:41 An: wmo-im/WIS WIS@noreply.github.com Cc: Thorsten Büßelberg Thorsten.Buesselberg@dwd.de; Mention mention@noreply.github.com Betreff: Re: [wmo-im/WIS] CF-NetCDF experimental data exchange (#1)

@hermannasensiohttps://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9+eW42b21oMzNpajIyaW0zOGgzOT8zO288OTJqOzo4aTw5ODszbm4yMjtqOy1/Njo9Ojo4OzM+OzotemJvNjo7RjJtbUNCOzszODgyJjo7RjJtbUNBOzszODgyLXloe382X2NkeXh/bmUlSX5ueHhuZ2lueWxLb3xvJW9uLWg2PjktY29nNjs=&url=https%3a%2f%2fgithub.com%2fhermannasensio and @tbuesselhttps://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9+eW42OGpqajs+PDo6aWhtOTNqaGo5OWkzOT9oM21uajoyaGhpPjhpOTpqMi1/Njo9Ojo4OzM+OzotemJvNjo7RjJtbUNCOzszODgyJjo7RjJtbUNBOzszODgyLXloe382X2NkeXh/bmUlSX5ueHhuZ2lueWxLb3xvJW9uLWg2PjktY29nNjs=&url=https%3a%2f%2fgi%20thub.com%2ftbuessel I am going to organize a meeting to discuss only this topic.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9+eW42PDM+PGo9Mmhqajw/OW49aT4ybjkyaWhvaD84O249OmhqOWo/Oj8/Pi1/Njo9Ojo4OzM+OzotemJvNjo7RjJtbUNCOzszODgyJjo7RjJtbUNBOzszODgyLXloe382X2NkeXh/bmUlSX5ueHhuZ2lueWxLb3xvJW9uLWg2PjktY29nNjs=&url=https%3a%2f%2fgithub.com%2fwmo-im%2fWIS%2fissues%2f1%23issuecomment-765276502, or unsubscribehttps://ofcsg2dvf1.dwd.de/fmlurlsvc/?fewReq=:B:JVMxOj48MS19NjklOy1ibzY7OjE7Oi14Ymxlan9+eW42PThqaDNqPGpqOj9uPTI+ODk5bzNpaG04ajhoPG0/PW07aGpqO2k+bi1/Njo9Ojo4OzM+OzotemJvNjo7RjJtbUNCOzszODgyJjo7RjJtbUNBOzszODgyLXloe382X2NkeXh/bmUlSX5ueHhuZ2lueWxLb3xvJW9uLWg2PjktY29nNjs=&url=https%3a%2f%2fgithub.com%2fnotifications%2funsubscribe-auth%2fARPSYO5J3JCHFE6LFFKJ3V3S3FB4NANCNFSM4WKVL6JQ.

petersilva commented 3 years ago

We are also seeing if this is also the way to distribute MET information for the aviation community over the ICAO SWIM (System Wide Information Management) environment. Even though a web server can handle a lot of simultaneous connections, the surge of incoming requests to the web server, after broadcasting the availability of certain data/product through MQ, may give rise to pulsating bursts of traffic which needs a special strategy to handle.

While I don't think we would set up a complete network for initial experiments, the GTStoWIS2 concept is that the messages separate the baseURL (identifying the server) from the relative path on the server. This means that you can have nodes that copy the data over to their own web servers, and then forward the messages with the baseURL changed to their own server. This is actually the normal way I would expect people would want it to work. Example: Germany announces they have a new product, the Russian server receives the notification over AMQP. The Russian WIS2 node downloads the product to it's own server, and then advertises the product to Russian clients with the Russian server in the baseURL field. The message is otherwise unchanged.

One extension we are currently thinking about is providing multiple/redundant baseURL's so that a recipient can have multiple sources to download from.... but that is not quite ready yet.

This allows WIS2 nodes the choice of forwarding third party messages unchanged, or creating a local cache and serving it, giving real-time independence to it's downstream. It permits a wide variety of topologies, does not impose any.

golfvert commented 3 years ago

5. If not we need to have another solution. I have the impression that there is a mood of stopping GTS NOW, but we cannot do that. A feasible solution would be to have a demonstrator project where the data are provided with pub/sub protocols. The data will never be exchanged on GTS and will be the first dataset to be exchanged on WIS2. However, this will require allocation of resources for the data producers to develop a pub/sub platform. Details and availability from the producer side to be discussed.

I wouldn't go as far as saying as stopping the GTS now, but, it would be a good opportunity to show what new solutions can bring. It is technically feasible to exchange netcdf on the GTS with the all shebang (TTAAii, RTHs...) nevertheless is it wise to "promote" the GTS like that when we want to transition to WIS2?

golfvert commented 3 years ago

We are also seeing if this is also the way to distribute MET information for the aviation community over the ICAO SWIM (System Wide Information Management) environment. Even though a web server can handle a lot of simultaneous connections, the surge of incoming requests to the web server, after broadcasting the availability of certain data/product through MQ, may give rise to pulsating bursts of traffic which needs a special strategy to handle.

@petersilva answered to this with a technical solution. However, I am not sure it is even a real concern. I guess we all have general public webservers that are used to handle these kind of bursts. Considering WIS2 is for a large but (almost) closed community (Joe Public will not subscribe to brokers...), I think we are rather safe on that side. Am I too optimistic?


Note: this is off topic and I don't want to start a thread on this here.

golfvert commented 3 years ago

@RemiGiraud please comment on this

I am @remygiraud (with a y). I don't know the "i" guy :)

blchoy commented 3 years ago

Note: this is off topic and I don't want to start a thread on this here.

Agreed, as different communities have different use cases. Glad if I could be informed of the progress as we may also need to be involved in the possible WIS-to-SWIM development.

SimonElliottEUM commented 3 years ago

@remygiraud yours is a good question about the wisdom of promoting the GTS whilst working on the transition to WIS2. Nevertheless, we should also avoid a Pyrrhic victory - it would be wrong to jeopardize the exchange of WMO-CF profile data by insisting on the use of evolving technology. We need to keep a careful balance

golfvert commented 3 years ago

@SimonElliottEUM I agree with you. I am hoping that we can do both. Promoting WIS2 and starting to exchange netcdf data "officially". Putting netcdf files on an FTP/HTTP/SFTP server and announcing the availability of data through pub/sub protocols could be a quick win.

jitsukoh commented 3 years ago

As @kaiwirt and @KenRJTD already pointed out, we need a documented plan for the experimental exchange with reasonably realistic data volume/frequency and the range of users based on the purpose of use, separately for glider and radar data. User stories:

I am a user and I need to process in real time all the glider data in netCDF. I am a user and I need to process in real time all the radar data from Australia.

is a good start but it is not enough to discuss the solution. If people start putting radar data of 30GB/site/day to GTS without any planning, this would paralyze GTS. For example, radar data necessary for producing QPE/QPF and for data assimilation to global NWP are different use cases and can require different solutions.

And as the first step, we need a format-level validation, as we always do for new BUFR/GRIB definitions, which can be done through Github.

golfvert commented 3 years ago

@jitsukoh I am hoping for a quick win here. So, of course, we need to agree on the format and make sure that there no shortcoming on the definition. But, when done, I would strongly support having these files on standard servers, as explained above, and having our first (pre)operational pub/sub usage. Download on the Internet only and no need for endless discussions about "should it be in the global cache of 15 GISCs" as we had a few years ago. AFAIK, we weren't able to achieve a practical solution for this, despite many meetings and many discussions. I really would like us to be more agile in that respect. I wrote in another email that the GTS is like a fragile grand-pa' that we need to protect from outside agression. So, let's have some (S)FTP/HTTP(S) servers, coupled with pub/sub announcements. No need to get X copies of the files in GISC cache. So, data available and announced. And done. I know that I am (a bit) oversimplifying.

david-i-berry commented 3 years ago

This is at the edge of my knowledge but I thought it might be useful to post to give some context to our initial plans.

Based on discussions last week the initial plan for the glider / autonomous surface vehicle (ASV) data is to make the data available via an ERDDAP server (e.g. https://ferret.pmel.noaa.gov/pmel/erddap/index.html) and then use the pub/sub announcements to advertise the availability. When new data are added to an ERDDAP server it is possible to trigger a subscription message either via an email or URL, it should be possible to plug this into RabbitMQ, possibly via a middle layer to trigger the messaging.

I believe this workflow sits quite nicely with that already in place for the glider / ASV data which are already published via ERDDAP servers. Only minor modification will be required to make the NetCDF data conform with what we are defining, with the majority of the new work focussed on the setting up the RabbitMQ side of things.

kevin-obrien commented 3 years ago

I would like to second what Dave wrote. As he mentioned, the ASV (Saildrone) data has been published through ERDDAP services as part of the Open-GTS project through GOOS/OCG. Once we finalize the NetCDF trajectory profile, we will update the data to conform to that format. If RabbitMQ is the message broker of choice, we can start looking at how to incorporate that..

golfvert commented 3 years ago

@kevin-obrien @DavidBerryNOC Strictly speaking, we are not going to recommend a particular broker implementation (RabbitMQ) but a set (I guess a set can start with 1) of acceptable pub/sub protocol(s). However, as RabbitMQ implements the one(s) that we are likely to choose (at least one of) AMQP 1.0, AMQP 0.9.1, MQTT 3.0, it is a safe choice. @petersilva teams has drafted the format of the pub/sub message. This draft (as all draft) might change. but, again, if you start working on that direction, please check with Peter and his team the format of the messages.

Unfortunately, and to make a long story short, the situation with regard to pub/sub protocols is messy. The (probably) best one (for our needs) AMQP 0.9.1 is not an ISO standard. AMQP 1.0 is but it is fairly limited. EDIT: MQTT 3.0 isn't ISO standard either is wrong. Something the WIS2 teams are looking at.

petersilva commented 3 years ago

fwiw. mqtt 3 is ISO standard... mqtt 5 was ratified... I think in 2019. It's just that amqp 0.9 has been used for many purposes for 10 years, and high performance cases, in particular, have not been demonstrated with mqtt.

golfvert commented 3 years ago

My bad. You're correct. MQTT 3.1.1 is ISO standard. MQTT 5 is too, but, I haven't seen many implementation though.

petersilva commented 3 years ago

mosquitto and emqx are two robust/popular mqtt implementations that purport to support 5. @antje-s and @josusky in my TT reported that they have done some testing in v5 ... but preliminary... yes.

kaiwirt commented 3 years ago

As pointed out earlier, we can prepare sftp accounts to which producers could upload their netCDF files. Login-credentials can be requested at wis AT dwd DOT de If this is agreed on, we would mirror the received files to our opendata.dwd.de system such that users can download all of the provided files from there. Additonally we can implement a topic for this in our MQP pilot system.

kaiwirt commented 3 years ago

@DavidBerryNOC It is not yet decided whether WIS 2 will use MQTT or AMQP. Tendency is toward AMQP imo