Closed petersilva closed 2 years ago
We also need for new products (without ahl) a possibility to map between
So the corresponding metadata record can be found via the filename and the topic value should be contained in metadata record (in the future). The pub/sub message contains the filename via relPath, so that mapping is possible from any direction: filename --> metadata --> topic topic --> message --> filename --> metadata metadata --> topic --> message --> filename
the colon (:) is an illegal filename character on windows... I guess when you say pid, you are using it to mean product identifier?
musing... I don't know what to do for file name delimiters... all the obvious ones are used... we are starting to get to the point where, like you did, you start doubling them up. I had double underscore, you had double colon. Perhaps using the Unicode character set available to us is more forward looking.
We could use UTF-8 for delimiter characters... as they are legal in file names... for example something like:
§ - Section Sign
¶ - Paragraph (Pilcrow?) sign
⁜ - dotted cross, an obelism... "Let it stand" ... not sure about this one.
※ - Reference mark ... seems pretty close to "metadata" in a single character.
▢ (unicode character for a white rectangle with rounded corners) could be used to denote a bounding box, for example.
or arrows... might be a fun.... but hard to type...
https://en.wikipedia.org/wiki/List_of_Unicode_characters
https://en.wikipedia.org/wiki/List_of_Unicode_characters#Geometric_Shapes or not...
could have a filename with sections... mymodeloutputfilename▢23,435,23,-3,-4,60※metadatakey__⁜randomizer.grib parsing would split on double underscore, and the look at the first character.... if you don't want a randomizer, leave that section out of the file name. As each section is tagged with an identifying lead character, it provides disambiguation.
The double colon is from metadata fileIdentifer-field (product identifier). see [https://library.wmo.int/doc_num.php?explnum_id=10141 ] page 53
"The URI should be structured as follows: • Fixed string “urn: x -wmo: md:”; • Citation authority based on the Internet domain name of the data-provider organization, e.g. “int.wmo.wis”, “gov.noaa”, “edu.ucar.ncar”, “cn.gov.cma” or “uk.gov.metoffice”; • Double separator colons: “::”; • Unique identifier: – For metadata records describing GTS products in bulletins or named according to the WMO file-naming convention P-flag = “T” or P-flag= “A”, the unique identifier is “«TTAAii»«CCCC»”; ..." So after "::" there should be a unique identifier which we could use in filenames (before file extension and separated from the rest of the filename by a fixed character e.g. "_")
The colon character is illegal in a file name on windows. I don't think a file name convention that does not work on Windows is a reasonable thing. Any file convention that has colons in it is likely unacceptable. Our last generation WMO file switch used colons extensively, and we have been engaged in decolonization (for a number of years.) A legal URL and a legal file name are not the same thing. Even in a URL, colons are most likely going to get percent encoded because they are used in the URL (scheme separator, and port separator) so are likely a complicated choice in practice.
Filename for the above example would be A_ISMD01EDZW231800CCG_C_EDZW_20210927120001_23567003_ISMD01EDZW.bin so there are no colons in filename. The double occurrence of TTAAii could be optimized. Your first suggestion above is very similar "- we keep the AHL in the filename somewhere" and the suggestion I made is just more general to have a uniform solution for products with and without ahl.
A few comments from me:
I would strongly object to using UTF-8 special characters in file names. While this might work in modern file systems it eventually breaks somewhere. My recommendation would be to stick with posix file name characters unless we have a good reason for something else. Additionally file names will end up not only in the files, but also in URLs, MQP messages, ...., That said are we sure, that each system involved can handle UTF-8 file names?
I agree, that we could have the pid of a product in the file name. We could encode colons as _ or some other mapping. Apart from that, why not generate something like UUID which can be added to metadata, file names, messages.
Do we really want to put TTAAii in the file names just for mapping purposes? In my opinion i would be happy if we could come up with something without AHL and have some procedure to map old and new where this is required.
My point was to only extent the filename with substring after "::" from pid
in order to achieve the mapping @kaiwirt is asking for in his third comment just above, need to keep all the strings that the optimization proposals are eliding somewhere... hence this issue, a proposal to put the elided strings in some sort of tags attribute; Whatever we we (non-redundantly) omit from the topic tree gets appended to a list of tags which are part of the file name. those tags, coupled with the topic tree should enable us to run the tables "backwards" to derive an AHL whenever that is possible. If we just drop the information, then a reverse mapping is not possible from the file name alone.
I think the direction from ET-AT was quite clear. People want the API response as the model, and file names are not really on the menu. There is clearly no appetite for any kind of file naming standard.
Looking at all the changes proposed, if we ever want the mapping from AHL -> topic to be idempotent and reversible, we cannot just discard the strings we don't want in the topic tree. here are two options:
for example, optimize topic tree 11: forecast/public --> forecast
we would add "public" as a tag.
so the name of the file under forecast would include "public" in it.
perhaps a comma separated list of tags would be an element of the new filename convention.