wmo-im / GTStoWIS2

Conversion of GTS headers to WIS2 topic
GNU General Public License v3.0
8 stars 5 forks source link

optimize topic tree 16: move deleted topic to "tags" to be used in filenames. #64

Closed petersilva closed 2 years ago

petersilva commented 3 years ago

Looking at all the changes proposed, if we ever want the mapping from AHL -> topic to be idempotent and reversible, we cannot just discard the strings we don't want in the topic tree. here are two options:

for example, optimize topic tree 11: forecast/public --> forecast

we would add "public" as a tag.

so the name of the file under forecast would include "public" in it.

perhaps a comma separated list of tags would be an element of the new filename convention.

antje-s commented 3 years ago

We also need for new products (without ahl) a possibility to map between

So the corresponding metadata record can be found via the filename and the topic value should be contained in metadata record (in the future). The pub/sub message contains the filename via relPath, so that mapping is possible from any direction: filename --> metadata --> topic topic --> message --> filename --> metadata metadata --> topic --> message --> filename

petersilva commented 3 years ago

the colon (:) is an illegal filename character on windows... I guess when you say pid, you are using it to mean product identifier?

musing... I don't know what to do for file name delimiters... all the obvious ones are used... we are starting to get to the point where, like you did, you start doubling them up. I had double underscore, you had double colon. Perhaps using the Unicode character set available to us is more forward looking.

We could use UTF-8 for delimiter characters... as they are legal in file names... for example something like:
§ - Section Sign ¶ - Paragraph (Pilcrow?) sign ⁜ - dotted cross, an obelism... "Let it stand" ... not sure about this one. ※ - Reference mark ... seems pretty close to "metadata" in a single character. ▢ (unicode character for a white rectangle with rounded corners) could be used to denote a bounding box, for example.

or arrows... might be a fun.... but hard to type...

https://en.wikipedia.org/wiki/List_of_Unicode_characters

https://en.wikipedia.org/wiki/List_of_Unicode_characters#Geometric_Shapes or not...

petersilva commented 3 years ago

could have a filename with sections... mymodeloutputfilename▢23,435,23,-3,-4,60※metadatakey__⁜randomizer.grib parsing would split on double underscore, and the look at the first character.... if you don't want a randomizer, leave that section out of the file name. As each section is tagged with an identifying lead character, it provides disambiguation.

antje-s commented 3 years ago

The double colon is from metadata fileIdentifer-field (product identifier). see [https://library.wmo.int/doc_num.php?explnum_id=10141 ] page 53

"The URI should be structured as follows: • Fixed string “urn: x -wmo: md:”; • Citation authority based on the Internet domain name of the data-provider organization, e.g. “int.wmo.wis”, “gov.noaa”, “edu.ucar.ncar”, “cn.gov.cma” or “uk.gov.metoffice”; • Double separator colons: “::”; • Unique identifier: – For metadata records describing GTS products in bulletins or named according to the WMO file-naming convention P-flag = “T” or P-flag= “A”, the unique identifier is “«TTAAii»«CCCC»”; ..." So after "::" there should be a unique identifier which we could use in filenames (before file extension and separated from the rest of the filename by a fixed character e.g. "_")

petersilva commented 3 years ago

The colon character is illegal in a file name on windows. I don't think a file name convention that does not work on Windows is a reasonable thing. Any file convention that has colons in it is likely unacceptable. Our last generation WMO file switch used colons extensively, and we have been engaged in decolonization (for a number of years.) A legal URL and a legal file name are not the same thing. Even in a URL, colons are most likely going to get percent encoded because they are used in the URL (scheme separator, and port separator) so are likely a complicated choice in practice.

antje-s commented 3 years ago

Filename for the above example would be A_ISMD01EDZW231800CCG_C_EDZW_20210927120001_23567003_ISMD01EDZW.bin so there are no colons in filename. The double occurrence of TTAAii could be optimized. Your first suggestion above is very similar "- we keep the AHL in the filename somewhere" and the suggestion I made is just more general to have a uniform solution for products with and without ahl.

kaiwirt commented 3 years ago

A few comments from me:

antje-s commented 3 years ago

My point was to only extent the filename with substring after "::" from pid

petersilva commented 3 years ago

in order to achieve the mapping @kaiwirt is asking for in his third comment just above, need to keep all the strings that the optimization proposals are eliding somewhere... hence this issue, a proposal to put the elided strings in some sort of tags attribute; Whatever we we (non-redundantly) omit from the topic tree gets appended to a list of tags which are part of the file name. those tags, coupled with the topic tree should enable us to run the tables "backwards" to derive an AHL whenever that is possible. If we just drop the information, then a reverse mapping is not possible from the file name alone.

petersilva commented 2 years ago

I think the direction from ET-AT was quite clear. People want the API response as the model, and file names are not really on the menu. There is clearly no appetite for any kind of file naming standard.