wmo-im / GTStoWIS2

Conversion of GTS headers to WIS2 topic
GNU General Public License v3.0
8 stars 5 forks source link

File Name Convention #41

Closed petersilva closed 2 years ago

petersilva commented 3 years ago

So far we have not discussed file naming conventions. In order to produce a complete recommendation for WIS2 processing, we need to agree on that. So far:

There is an existing file naming convention for file transfers in the WMO 386 manual. The most expedient thing to do would be to permit these old names for now, and think about how new products would be added. On the other hand:

petersilva commented 3 years ago

Here is an unhelpful example of bulletin AHL:

FTCN31 CWAO 312100 AAA

TAF AMD CYWL 312109Z 3121/0107 25008KT 2SM -RA BR BKN005 OVC010 TEMPO

     3121/0101 6SM -RA BR BKN010 OVC025

    FM010100 24008KT P6SM BKN040 OVC080 TEMPO 0101/0105 4SM -RA BR

     SCT005 OVC020

    FM010500 30008KT P6SM SCT080

    RMK NXT FCST BY 010100Z=

So the bulletin header is CWAO (Canadian Central Weather Analysis Office) as the National authority for emitting such products, but the site that is being forecast for. is CYWL (Williams Lake, British Columbia) some 3500 km west and a little north of CWAO. Ideally, the AHL would include the forecast area as well, but FTCN's are special, because the WMO requires bulletin collection, so we can have upto 50 sites under a single header (which includes the CCCC) The AHL does not have the information that would be most helpful to clients.

If we stopped doing bulletin collection, then the message could be transmitted as FTCN32 CYWL and be much more useful to end users. (Of course the day we do that, GTS routing tables will explode because every site will then have it's own AHL, much more useful for end users... slight problem for GTS administrators.)

kaiwirt commented 3 years ago

I would not change the meaning of CCCC in the bulletin headers as being the center that is originating that said bulletin.

Of course file names for WIS2 can be different.

I would second, that we should not put metadata in the file name. In my opinion file names can be anything as long as there is a link between file and corresponding WIS2 metadata. That way, information on the file and file content can be retrieved by looking up the metadata from the filename, and conversely when i search a product i end up with the metadata describing that product which should then point to the files.

petersilva commented 3 years ago

for WIS2 only... yeah... I wasn't talking about turning off collection in GTS, would cause issues, but rather that on WIS2 we could have the option to present data in files un-collected, and it would be a lot more useful than the collected forms... a feature of WIS2.

@kaiwirt if we want a pointer to metadata, we could use the integrity checksum (from the body of the message) and put that in the file name. but that would mean an md record for every single occurrence, which is a problem, my metadata people tell me. They want to have a single md record cover an entire data set... and not have a new md record generated and circulated for every surface ob every hour (or every five minutes, as certain of our stations do now.) Assume we have some sort of key value that is used to look up metadata, ok. it needs to be provided by the data source, and it isn't present in the TAC, BUFR, or GRIB currently. so to get that key when then have to look up the TTAAii in a set of tables to generate that md key. That's exactly the tables we have built in this repository. Would adding a key field to be supplied to md db's and our file names be helpful? How else do we get those values if the upstream source does not provide them (like with all current data formats) we could start with a mdkey 'not-available' for all data and gradually fill in the gaps.

Practically speaking, I have a topic ... observations/surface/ca, and the directory is filled with essentially randomly named files, I have to look up the random names in a metadata database... I cannot select any smaller than a country without reading the files, or performing md lookups. I don't think that is the best we can do.

I suspect a minimum of metadata needs to be in the file name:

so... something like:

_location_mdkey_verionOrRandomizer.format 20210901T1027__bbox=-141,41.7,-52.4,83.2_elev=12m__SomeArbitraryAssignedMdKey__8064d8dc1a1c71b014e0278b97e46187.grib under: ca/canadian_met_centre/forecast/model/ 20210901T1200+6hr__global__GEMHRDPS-NAMv34.5__8064d8dc1a1c71b014e0278b97e46187.grib This would be the 6 hour forecast of a Canadian GEM model on a high resolution deterministic prediction system run, using version 34.5. The upstream source naturally has to provide the key "GEMHRDPS-NAMv34.5" as a metadata key... but now we have to standardize md lookup... Looking at how CMC distributes grib files, I think they want variables in the file name as well, because people often only want certain variables from the model outputs. so I guess that could be part of the metadata lookup key. I think a meaningful metadata lookup key might be most helpful in this case, rather than an encoded pointer. The valid time could include a decimal to go sub-second if needbe, and we could have the the minus sign separate start and end times to cover a range (for forecasts.)... or + to allow things like +6h, +12h, +48h ... I would ask Tom Kralidis and the metadata people for how to provide that geo info. bbox not ideal for all cases, but often is. for grids, for example, probably just a keyword "global", or harkening back to some spherical thing. also vertical is missing. anyways, this is just to get a flavour. Get time, location, datatype. ( https://gist.github.com/graydon/11198540 .. country bounding box source ) ( bbox syntax from: https://wiki.openstreetmap.org/wiki/Bounding_Box ) The idea is to provide the maximum amount of information in the file name to help the user decide whether they want to download the file. If they have to read the file to understand whether it is relevant to them, then that is a form of failure. but agree that md in the file name should be minimum... nothing more than what is most useful to aid in selection prior to download.
petersilva commented 3 years ago

https://docs.opengeospatial.org/is/12-063r5/12-063r5.html interesting reading.

petersilva commented 3 years ago

idea that will get shot down: instead of __ as a separator, use a utf-8 Section character like: § taking advantage of utf-8 where it helps with clarity.

tomkralidis commented 3 years ago

Can we come up with a hierarchy of named geographies or spatial keywords? If the goal is to provide a high level idea of "where" the file is, we can achieve this without coordinates per se. This can be real administrative areas with various levels of nesting (i.e. canada.alberta.calgary) or synthetic keywords (wmo.ra1), which resolve to real geometries (precise or fuzzy). Lots of options for dereferencing.

petersilva commented 3 years ago

I guess we have been saying that we don't want metadata in our file names the entire time, and when I see things like canada.alberta.calgary, I see that as metadata, where the real data is a bounding box. It takes seconds to look up latlong on the internet these days, and people get it on their devices from GPS, and it is dead easy to compare numbers. I don't think people establishing a bounding box for the data they are interested in is a high bar. if we have to look it up, it's probably metadata.

I hate to be the guy to explain to someone that that isn't how Kanada is spelled... in any event, these subscriptions are likely to be used by services... not really people.. so the syntactic sugar is lost on software.

Still, there is clearly a need for a few named areas... global (global grids), northern hemisphere? ... perhaps countries. which are equivalent to the synthetic keywords. I mean observations at sea... or in the air... you could enable that with and "area=" keyword. Which would provide the flexibility you want, and we just permit both to exist and see what practice gets adopted.

If we start with adopting say... point=lat,long and bbox=minlong,minlat,maxlong,maxlat perhaps add elev=h, or x+y (for a range)... we could have a area= for named ones, such as global... Would that be something we could start with (we can easily look that stuff up for the 6000 or so CCCC's) to get some sample data... It's literally there in keywords... and practice can figure out what is best in ... um... practice.

In the end, the data producers would have to start producing the data with the information available.

antje-s commented 3 years ago

To the metadata link between file name and WIS2 metadata....

To the collections...

kaiwirt commented 3 years ago

We have been discussing something like product and product instance. So the metadata should be describing a product. Something like "observation from station xyz"

The specific observation in terms of data and refering to a specifig time is an instance of the product. This instance is stored in a file.

The key point here is how can i discover the relevant files (=instances) for a product (=metadata). This could be with regular expressions for the file name(s) or with listing the MQP topic in the metadata. And the other question is if i have a file (=instance) how can i find the corresponding metadata to learn more about the product (some constant key in the filename that is also part of the metadata could do)

I agree with @petersilva to have a specific key in the metadata which is also part of the filename for all corresponding files much like a database index to collect related information from different tables.

Geographical location would be part of the product and goes to the metadata in my opinion.

My point of view is that the user should not decide which files to download based on the file names, but on the WIS Metadata. We have search interfaces for that, we have filters for that. All the relevant information about a product in human readable form is in the Metadata. Why duplicate that in the filename in sort of like an awkward clumsy lengthy pile of letters? Just use the metadata, find the link to the files and what you then download is (or should be) what you were looking for.

The huge show stopper for users can be the granularity of metadata. And the use case @petersilva pointed out where users want only parts of a product.

At TT-GISC we have been discussing, that we only want to have 1 metadata record for a model (in contrast to have metadata records for every parameter), thus we might need to also think of "flavors" of a product.

And we keep on thinking about files. But (the same) metadata format has to also be useful for web services. And there you basically don't have a file name to rely on.

petersilva commented 3 years ago

I think for web services to be brought into the discussion... people making web services need to think about naming the result of a web service query. What would you name the series of bytes that results from a web query. The pub/sub stuff is products produced... they are not a response to a query... so it is like picking things off an assembly line as it goes by, rather asking for someone to build something bespoke.

A file can be viewed as a canned response to a web service... and the problem for consumers is understanding what the query was without having to read the file (and without having to even download it, because the file name will be in the message.)

petersilva commented 3 years ago

@antje-s on topic tree in file name... the topic tree, in the static file case is in the directory tree, so no need to put it in the file name as well. We would want to put information in the file name that:

  1. does not repeat what is in the topic tree.
  2. points to a metadata record for the item, without repeating it for the most part.
  3. reduces the number of users/frequency of having to look up stuff in an alternate data source in real-time.

So valid time meets all of the above. The geo information meets points 1 and 3. file type is only 1 and 3 for now. Literally any information we put in a file name is metadata. so point 2 argues for file names that are solely pointers to md records. one could argue that the .bufr file extension is metadata as well. I think there is some minimal information to provide with the file, and that some sort of easily read geo fits the bill.

I don't like the idea having to look something up for every item in the stream... it's inelegant and slow, and it adds all sorts of dependencies: no-one can know whether your data is interesting without uptodate metadata... The metadata network with OAI/PMH etc... and updating of that, becomes a pre-requisite for sending any data... To me, a lack of metadata should not hinder people from deciding on real-time flows. deployment of metadata and data should not be intertwined. Of course, life is much better with rich metadata, but lack thereof should not prevent flow.

@kaiwirt : "My point of view is that the user should not decide which files to download based on the file names, but on the WIS Metadata." I agree, I just think people should consult the WIS Metadata ahead of time. Get the information they need once, and then program the flow so that it does not require real-time consultation of the metadata. To do otherwise means no-one can download anything unless they also have a full WIS metadata db installed, that operates at a speed that keeps up with the real-time flow. my current prototype flow, for example, moves at 300 files/second when there is a backlog. It requires engineering to get any db to keep up with such a rate, which is by no means maximal.

the way talking with the TT-WISMD people have been describing topics... they will include that in the MD, which is fine. so you need to go to the WISMD to get parameters for any feed. To me the output of the WISMD query should include a geo tag that can then be applied against the file names, with no requirement for real-time lookup in another DB.

petersilva commented 3 years ago

Also, the subscription tables, if everything is just based on individual station id's will be huge, because once you have picked a data type of interest (the topic) you are stuck enumerating every station individually (there won't be any commonality in the station names/md keys). so what a subscriber could easily express in single statement (like a bbox) becomes a long list of particular stations.

petersilva commented 3 years ago

sample google URL from @tomkralidis

https://www.google.com/maps/search/montreal+metro+system/@45.509788,-73.6483943,13z/data=!3m1!4b1

shows how google encodes lat,long,elev in a URL. We could use a similar scheme:

@ - location followed by three comma separated numbers -> point identical to above. followed by six numbers -> bounding box followed by a label instead of numbers -> area.

much more succinct way of expressing it.

kaiwirt commented 3 years ago

I would prefer, that metadata would become a prerequisite for data transfer. One current shortcoming of WIS is, that nobody cares for the metadata. If you would make metadata a prerequisite for data transfer then data producers would be forced to provide metadata and users would be happy because they would find metadata for data they receive.

For data transfer this would be a one time lookup. Find metadata, look up topic and broker, and then subscribe to the bespoke MQP service. And from then onwards you receive messages, extract content or URL and download data, without having to go back to the metadata. In that setting i don't see any advantages of having a specific file naming convention, as long as you something that links the building blocks.

antje-s commented 3 years ago

...a specific key in the metadata which is also part of the filename for all corresponding files... --> sounds like a good way to implement a mapping in both directions

petersilva commented 3 years ago

agree on some kind of metadata key...I just want to also have the geo information as well... otherwise the selection is nearly useless... unless the metadata super descriptive, all selections will just boil down to lists of station ids.... regex will be of no use.... perhaps I am jumping to conclusions... do you have some examples of what a useful metadata key would look like?

petersilva commented 3 years ago

there is the versioning problem... someone does a metadata lookup, they get the list of stations within the area they were looking for, they make their subscription. Two years later one third of the stations have been replaced. unless the user does a metadata lookup again, they will never know about new stations. How often should they do that? well ideally, the same time they look up the data, so we are back to real-time lookups of metadata.

If you have selection using geo information... unknown stations will show up in the feed, and the user can then go consult the metadata repository to find out about the new station. It´s a discovery aid.

Similarly if stations are removed... the config will still uselessly list them until the user does a new query to WIS metadata.

kaiwirt commented 3 years ago
  1. A metadata key can be any id (alphanumeric string) that is listed in the metadata and that is part of all related file names
  2. I get your point on the geographic information and the versioning problem. In the old GTS, changes to the contents of data should be announced by updates to Volume C1 + NOXX GTS messages. The idea in WIS was, to have a similar procedure. So, before adding new stations to a file, someone should update the metadata and following that update some kind of procedure should be established to inform users of the changes. Sadly, this procedure has never been developed. However there is a (new) task in TT-GISC to move from Volume C1 + NOXX to metadata + update information. But i also see, that this is a manual procedure and will eventually not be perfect.

To me it is important though, that we get users to create good and up-to-date metadata

petersilva commented 2 years ago

https://stacspec.org/STAC-api.html#operation/postSearchSTAC provides an example bbox encoding:

"Example: The bounding box of the New Zealand Exclusive Economic Zone in WGS 84 (from 160.6°E to 170°W and from 55.95°S to 25.89°S) would be represented in JSON as [160.6, -55.95, -170, -25.89] and in a query as bbox=160.6,-55.95,-170,-25.89."

They also optionally allow 6-tuple to add meters elev.

petersilva commented 2 years ago

64 added based on recent submissions from Antje, and a desire not to lose the ability to convert back to GTS names.

antje-s commented 2 years ago

As planned in today's meeting, I will try to summarize my suggestion...

...starting with a current operational example: ISMD01EDZW --> metadata record includes as identifier (pid):

urn:x-wmo:md:int.wmo.wis::**ISMD01EDZW** --> filename of an instance of the product is: A_**ISMD01EDZW**161200CCC_C_EDZW_20220216140303_76389172.bin --> current topic value GTStoWIS2: de/offenbach_met_com_centre/observation/surface/land/fixed/synop/main/0-90n/90e-0 From the example you can see that without the component TTAAiiCCCC in the filename you would not easily find the metadata. Even if the topic values are included in the metadata, many products would be found as a result. This illustrates that mapping makes sense. WMO-No.1060 includes in Appendix C in section 8: "For metadata records describing GTS products in bulletins or named according to the WMO file-naming convention P-flag = “T” or P-flag= “A”, the unique identifier is “«TTAAii»«CCCC»”;" There is only one match for a fulltext search "ISMD01EDZW" in WIS Catalogue. Thus, if the position within the filename for TTAAiiCCCC is fixed, it is relatively easy to find the appropriate metadata record for a specific product instance with all other relevant information. If a unique part of the identifier value from the metadata is also integrated into the file name for new wis2 products without ahl, this proposed solution should also be effective for this. From my point of view, there is no reason why the filenames should not be structured according to the guidelines of the filename convention of WMO No. 386 Volume I in the future as well Filename format: pflag_productidentifier_oflag_originator_yyyyMMddhhmmss[_freeformat].type[.compression] Then the unique part of the pid could be in the place of the productidentifer. Alternatively, for file names of products that do not conform to this form, but e.g. comply with manufacturer-specific specifications, the position directly before the file type ending could be used. separated from the rest of the name with a fixed separator, e.g. "_". But such filenames should not start the same way as the WMO pflag start sequence Hopefully you can understand my explanation of the proposed solution and it is not too confusing
petersilva commented 2 years ago

76 GTS names included...

petersilva commented 2 years ago

ET W2AT was challenging. There is a strong current that no metadata should be in the file name.... well that means there is no point in filtering using file names... it all has to be done using message fields... which feels like a strange decision to me....

One objection was that... whatever we do... we are essentially inventing a micro-format... something new for everyone to parse. ok... so.. I guess we could just make the filenames totally opaque by making the base64 encoded subsets of the mqp message? throwing stuff a the wall at this point, just to see what it looks like...

If we assume we can add bounding box and valid time to the MQP payload, then the file name contained: { ... "bbox": ..., "validTime": , "mimetype": ... } and base64 encoded. so the filename just becomes are reaonable subset of the mqp payload.

{"bbox":[-68.20,42.55,-51.33,53.54], "validTime":{"start":"20210901T1200","offset":"6h"},"mimetype":"application/x-wmo-bufr"}' )

you end up with a file name like:


eyJiYm94IjpbLTY4LjIwLDQyLjU1LC01MS4zMyw1My41NF0sICJ2YWxpZFRpbWUiOnsic3RhcnQiOiIyMDIxMDkwMVQxMjAwIiwib2Zmc2V0IjoiNmgifSwibWltZXR5cGUiOiJhcHBsaWNhdGlvbi94LXdtby1idWZyIn0KCg==.bufr

not obvious... the WAF would be filled with files like that... I'm not thrilled...

Are there standards or conventions people are aware of for how to encode things in file names, to avoid inventing an idiosyncratic microformat, at least use one that others already use?

petersilva commented 2 years ago

could just use the raw json... except it won't work on windows because of the colon... replace : with some other unicode character...

{"bbox"§[-68.20,42.55,-51.33,53.54], "validTime"§{"start"§"20210901T1200","offset"§"6h"},"mimetype"§"application/x-wmo-bufr","productID"§"ISMD01EDZW"}

to parse you replace § with : throughout, and then feed it through a json decoder. That's readable at least...

petersilva commented 2 years ago

for the bbox, just referring to https://datatracker.ietf.org/doc/html/rfc7946#section-5, the same way STAC does.

The problem with the JSON encoding is that most systems today have a 255 character limit on file names, and this tiny example is already 150 characters. I suspect that with very few additions, it will be too long.

petersilva commented 2 years ago

Looking for references... not finding much...

kaiwirt commented 2 years ago

My five cents are: The filenames can be whatever the data producer wants them to be. If CA wants to have bounding boxes or temporal in the filenames go ahead. If DE wants to have UUIDs in the filename go ahead. If UK wants to have whatever, go ahead.

I just would not build WIS2 such that it relies on filenames and file name conventions in order to work. The description of whats supposed to be in file a or b or c is in the discovery metadata and the topics.

In the same line: if someone decides to have WAF, go ahead. I just would not make it a requirement, because WIS2 is supposed to send the URLs to files out in the messages. Thus, at least for the real-time-case, receiving the message is enough. I do not need to worry about the structure of the web server or the file names. I just fetch whats at that URL. And i receive URLs for data i am interested in (because i subscribed to the corresponding feeds in the first place).

petersilva commented 2 years ago

If the topic convention is specific enough then leveraging file names is unnecessary. If the filenames don't matter, because topics are good enough, then I agree that just getting the topic hierarchy agreed is sufficient. However: is that really the general case?

I would expect, for example:

For MQP users I hope we will get a consensus to put such geographical and temporal extent in the message payload, but WAF users will not be able to leverage that. Is WAF an important use case? I don't know.

To give an idea: in Canada, the WAF service provides about 10x the data (55 mhits/day .. > 4 TiB I do not have current numbers @tomkralidis might.) as the OGC service, and on about 1/5 the hardware. So in this wildly unrepresentative sample, WAF:OGC is at about 50:1 in terms of bytes/$. WAF and OGC are complementary services, WAF+MQP is good for handling real-time feeds and web scraping. OGC is good for more structured sampling, and certainly supports much friendlier access methods, but is terrible for the WAF type access.

petersilva commented 2 years ago

I think the consensus is that no file name convention should be established.