wmo-im / iwxxm

XML schema and Schematron for aviation weather data exchange
https://old.wmo.int/wiswiki/tiki-index.php%3Fpage=TT-AvXML
48 stars 22 forks source link

Simplify IWXXM schemas #27

Closed braeckel closed 6 years ago

braeckel commented 7 years ago

There have been multiple comments over the past two years to the effect that IWXXM is unnecessarily complex, especially when compared to the original TAC messages. Some of this complexity is necessary and comes directly from the original TAC data, and some of it is unnecessary.

Reasons to simplify IWXXM include:

Reasons not to simplify:

O&M structure and its mismatch with IWXXM reporting constructs, the use of many dependent schemas (GML, sampling spatial, METCE, etc.), and heavily nested content all may be contributing to the sense that IWXXM is unnecessarily complex.

Removing the use of GML is one simplification option, but due to the many other advantages of GML it is not considered a feasible simplification alternative.

braeckel commented 6 years ago

Another simplification technique would be to collapse the structure of "deep" structure like AerodromeRunwayVisualRange and AerodromeHorizontalVisibility to the top level. Presumably this would be augmented with Schematron co-occurrence rules

braeckel commented 6 years ago

Here are some data points related to simplifying by removing O&M (.noOM. files) and by getting to the simplest form I could envision short of removing GML (.simplest. files). These simplified examples are not perfect and I suspect there are minor mistakes and inconsistencies, but for the purpose of comparing approaches they should suffice. All of these examples are based upon IWXXM 2.1.0, the ".xml" without a sub-extension are the original examples from the release.

I wrote a simple utility to collect metrics on each file and to aggregate numbers for comparison. The avg depth of elements in a file is how deeply nested the elements are, on average, where the root element has depth 1 and a deeply nested element could have a depth of 7 or more. # elem is how many elements were in the file. # ns (number of namespaces) is how many namespaces are used, which is a proxy metric for the complexity imposed by pulling in other schemas as dependencies. Bytes no whitespace is the size of the file after unnecessary whitespace has been removed, and bytes w/whitespace is size including the whitespace exactly as they are in the attached examples. Bytes GZIPed is the GZIPed size of the original example with whitespace. The DOM load time is how long it takes the library to read the example into memory as DOM objects in nanoseconds. This last metric varies from run to run and should be considered as context rather than fact.

The file-specific numbers are as follows:

file avg depth # elem # ns # lines bytes no whitespace bytes w/whitespace bytes GZIPed DOM load time ns
metar-A3-1.xml 5.24 94 7 154 6708 9707 1882 2872990
metar-A3-1.noOM.xml 4.39 72 3 119 4695 6665 1426 2575565
metar-A3-1.simplest.xml 4.26 62 3 101 4153 5785 1285 2010746
sigmet-A6-1a-TS.xml 5.34 67 7 115 4403 6493 1633 1673078
sigmet-A6-1a-TS.noOM.xml 4.27 48 3 86 3128 4545 1194 1307594
sigmet-A6-1a-TS.simplest.xml 4.32 31 3 61 2475 3503 1035 1074463
taf-A5-1.xml 5.09 129 7 202 8618 12619 1795 1635994
taf-A5-1.noOM.xml 5.15 99 2 163 6450 9648 1303 1447112
taf-A5-1.simplest.xml 4.47 80 2 130 5208 7361 1160 3759243
tc-advisory-A2-2.xml 3.95 120 7 208 7884 11234 1577 1465392
tc-advisory-A2-2.noOM.xml 3.40 81 3 136 5396 7194 1157 1115489
tc-advisory-A2-2.simplest.xml 3.33 72 2 123 5027 6632 1038 1028251
va-advisory-A2-1.xml 6.32 158 7 256 8867 12563 2112 1127183
va-advisory-A2-1.noOM.xml 6.82 134 4 225 7388 10807 1794 1155570
va-advisory-A2-1.simplest.xml 6.97 127 4 214 7110 10440 1680 1026740

The aggregate numbers are as follows:

file avg depth # elem # ns # lines bytes no whitespace bytes w/whitespace bytes GZIPed DOM load time ns
*.xml (original IWXXM 2.1 files) 5.25 113.60 7 187 7296 10523 1800 1754927
*.noOM.xml 5.12 86.80 4 146 5411 7772 1375 1520266
*.simplest.xml 5.06 74.40 4 126 4795 6744 1240 1779889
.noOM.xml rel to .xml 97% 76% 57% 78% 74% 74% 76% 87%
.simplest.xml rel to .xml 96% 65% 57% 67% 66% 64% 69% 101%

Overall the main advantages of simplification for removing O&M are in reduced dependencies/namespaces, reduced # of lines and elements, and improved file sizes. The "simplest" files are a further modest improvement over "noOM" on lines, elements, and file sizes. There was virtually no impact to average element element depths for either simplification case.

iwxxm-simplification.zip

marqh commented 6 years ago

I have a particular concern regarding this change, that I would like to summarise on this ticket.

IWXXM 2.x makes use of the O&M structures within the encoding.

My organisation has concerns that making structural changes to the encoding between IWXXM 2.1.1 and IWXXM 3.0 could have a significant impact on work we are undertaking to provide encoding and decoding capabilities for IWXXM.

Given the uncertainty in time scales for IWXXM 3 and the plans for adoption of IWXXM 2.1.1 this poses problems for managing a transition from 2.1.1 to 3.0. We have work programmes that may be aiming to deliver 2.1.1 capabilities before 3.x is ready for adoption.

We know that IWXXM 3.0 will bring backwards incompatible changes, some of these changes are easier to manage with consistent solutions for 2.1.1 and 3. For example, the adoption of UUIDs is a backwards incompatible change, as 2.1.1 messages don't mandate it, but they can implement it, so developers may future-protect through early adoption.

In contrast, the level of disruption to encoding and decoding capabilities that this ticket proposes is much more difficult to manage. If adopted, it seems likely that different code implementations within encoding and decoding software will be required in order for software to handle both IWXXM2.1.1 and IWXXM3.x This level of disruption could be quite challenging to manage and may hinder the adoption of IWXXM3.x significantly.

For this level of disruption to be worth while, I think a really major improvement is required and that the benefit should be clearly demonstrated. On balance, I do not see that the change proposed here represents enough of an improvement to be worth the effort and disruption. I do not think that the benefit of reduced complexity within the messages delivers enough to the developer communities who will be working with IWXXM. I fear that the potential benefit from simplification could end up being dwarfed by the complications to version management that it would bring. In this case I think the cost is too high, compared to the potential reward. I think that the efforts of the task team will be better focussed on maintaining stability, introducing major changes to meet identified detail issues, with as limited disruption as possible.

I think that it is more worthwhile to stick with the O&M constructions and manage the consequences whilst keeping the level of change from 2.1.1 to 3.x to a manageable and limited extent.

I would appreciate further feedback on these thought, particularly from HMEI representation within this team, and also from member state representatives.

thank you mark

marqh commented 6 years ago

Following on from the task team discussions on this issue, I would like to request that task team members indicate their view, by 'voting' on this comment using the 'reaction' tab: top right corner in the menu bar at the top of the issue: https://github.com/wmo-im/iwxxm/issues/27#issue-261693504

Please use either +1 (in favour) or -1 (against) only

I encourage all members to consider voting one way or the other. This issue is in need to specific resolution.

I will leave 2 weeks for members to consider and provide their votes. We will aim to have a task team meeting the week beginning 12th February to discuss this topic and how to interpret the comments and votes.

I encourage further considerations and comments to be added to this ticket by all interested parties.

thank you mark

mperoutka commented 6 years ago

I note two additional reasons not to simplify. One was shared via e-mail earlier. It seems wise to include it here for completeness.

Thanks! --Matt

blchoy commented 6 years ago

Let me rephrase the question to see if there is any new view points. Correct me if I mentioned anything wrong.

TACs were there to provide end users MET information with the least communication overhead. It is thus based on a collection of use cases which involved mostly the end users. This is also why it is rather inconvenient for others to use.

With the introduction of IWXXM we have the opportunity to make the formats cover more use cases (including interoperability) but that introduce complexity into the format with additional metadata and requirements (e.g. UUID for gml:id). But how far should we go?

From a scientific/WMO perspective, we would like the format to be as descriptive (though most of the elements could be optional items) as possible. From operational/ICAO perspective, however, we would like it to be as simple as possible to save processing and transmission efforts. And the comments I have seen so far are roughly split according to the people's background as above.

Looking back, using O&M in legacy products like OPMET since IWXXM 1 was probably going a bit too far, but if we go one step backwards now we may slow down the evolution of IWXXM which IMO would ultimately involve the use of O&M and Matt has already made the pros/cons clear. Finally, the use of a "presentation format" for rendering IWXXM messages will also make the discrepancies between the O&M and no O&M versions less discernible to end users (this leaves only the developers, implementers and operators to border).

So the question(s) we should be asking: (1) Is O&M something we will be using ultimately in describing MET phenomenon? (2) Is the simplified legacy products really justified the effort we are going to put in, taking into account of the development of "presentation format" and (1) above?

braeckel commented 6 years ago

This is likely to be the last opportunity for significant structural improvements to IWXXM for a number of years given the near-term operational implementations and investment. The structure we use in IWXXM 3 will be in use for some time; either smaller and simpler or more backwards compatible with IWXXM 2.

From a technical perspective there are a number of ways in which O&M does not match well with IWXXM requirements today, and matching O&M to future requirements is currently speculative given the uncertainty around the future ICAO system. O&M brings with it 3-4 dependent schemas and namespaces (O&M, O&M sampling, O&M spatial sampling, and in some cases METCE Process), conceptual complexity in the form of understanding the O&M standards and structures, many mandatory and optional elements that are effectively unused in IWXXM exchanges (om:type, om:observedProperty, om:metadata, om:relatedObservation, om:parameter, om:resultQuality, and others), report-level information that is duplicative and interlinked between OM_Observations (such as aerodromes and FIRs), and with some products mandatory om:validTime/om:resultTime/om:phenomenonTime constructs that are almost always duplicates (xlinks) or never used. O&M has valuable concepts and definitions (om:phenomenonTime, om:validTime, etc.), but the OM_Observation structure itself fits awkwardly and imperfectly into the information structure that needs to be exchanged. It should be noted that the O&M usage varies between different IWXXM products (i.e., VAA, METAR) based on the needs for each product.

Removing O&M would restructure time and other O&M information and is more significant in nature than UUID changes (for example) but every IWXXM change breaks compatibility with either producers or consumers if it includes nearly any type of schema modification. There are some types of schema changes that do not break backwards compatibility but they have been rare. I suspect the required software changes for removing O&M in IWXXM 3 are comparable to the combined scope of 3-4 of the other 35-40 issues/changes that will eventually be part of IWXXM 3, and I am therefore not as concerned about it significantly hindering adoption in the context of the full list of IWXXM 3 changes.

The unique beneficial uses of OM_Observation structures embedded in IWXXM messages are unclear to me at present. Will OpMET organizations search for O&M observations across multiple products (IWXXM and others)? Contents outside and between O&M elements (including issue times, VAAC/WMO/ATS ids, xlinked time references between observations and other shared report-level constructs, sequence numbers, remarks, bundling of OM_Observations into reports, etc.) are important to meaningfully parse an IWXXM message in nearly all cases and it is difficult to see each O&M structure standing alone or providing a significant aggregation mechanism across multiple products. Can an IWXXM METAR trend forecast OM_Observation alone be usefully combined with an OM_Observation structure from another schema/product? I simply don't see the advantages of OM_Observation itself unless ICAO someday determines that O&M is the exact requirement. In this case we will be working with quite different products built atop OM_Observation and we will enter new territory with regards to how content should best be structured, which are essentially new products with more far-reaching impacts than the removal of O&M.

I think the primary question is not whether O&M is useful (since there are a large number of issues and unclear benefits with O&M) but whether the improvements gained through simplification are worth breaking compatibility in this part of IWXXM. I would not support using O&M if we were starting from scratch but it is more debatable as to whether it should be removed now.

As summarized in the ".noOM.xml rel to .xml" line at the end of an earlier comment it appears to be quite feasible to reduce file sizes, # of elements per report, and # of lines by 25% across the board and drop 3 of the 7 standards/schemas/namespaces currently required by IWXXM by removing O&M, and that other simplifications can account for at least an additional 10%. These improvements and the lightened developer burden that comes with simplification are essentially what should be weighed against the value of backwards compatibility with the O&M portion of IWXXM 2.

While much of the simplification discussion has focused on O&M (including most of our discussion above) there are other simplification approaches that can be undertaken. If simplification is agreed upon we should then discuss which simplification approaches should be used. The difference between the .noOM.xml and .simplest.xml files, for example, are mainly changing some AIXM objects to simple IDs and flattening the contents of deep IWXXM types to reside on the om:result type (such as the contents of AerodromeSurfaceWind and AerodromeHorizontalVisibility on METARs). There are certainly other approaches that have not yet been considered.

blchoy commented 6 years ago

@sforeman: To continue the debate we need some WIGOS experts to shed lights on the adequacy of using O&M and perhaps METCE too to represent MET phenomenon, especially for exchange purposes. Who can we invite?

jkorosi commented 6 years ago

Personally I like the idea to simplify IWXXM. I think the simpler things are simpler to implement and to maintain. Therefore the oportunity to decrease complexity of IWXXM should be considered really carefully. Unfortunatelly I am not familiar with O&M standard and I cannot consider if it will be valuable in the future or not. On the other hand at present IWXXM defines only aviation formats and therefore I suppose it will be used mainly by aviation community in next few years. They have to follow the ICAO Amendment's. Based on the Compatibility of releases of IWXXM with requirements of Annex 3 to the Convention on International Air Navigation available at https://wiswiki.wmo.int/tiki-index.php?page=dataModelStatus, I believe that there is only one version IWXXM operational for Amendment 77 (I know it is not authoritathive, can you direct me to the right authoritative document where it is/will be defined please?). I guess that the same will happen with Amendment 78, where IWXXM 2.1.1 will be replaced by IWXXM 3.1. In other words IWXXM 2.1.1 will be used "operationally" only up to November 2018. After that date it should not be used by states who should begin to use IWXXM 3.1.

If I am right then IWXXM 2.1 should continue to be used for "only" half a year from now, so we can afford to maintain support for IWXXM 2.1 for a limited time interval separately from our support for IWXXM 3.1 - again based on the assumption that sometime in 2019-2020 we will be able to remove support for IWXXM 2.1 from our codebase. Simplification of the schema will lead to faster decoding and less RAM use, when you realize that several thousands of reports may be decoded when plotting a METAR chart. Our implementation of IWXXM METAR decoders is currently 6.5 times slower than TAC decoders (4916 reports), for TAF it is 8 time slower (13240 reports). Of course there is room for optimization in our XML parsing mechanisms, but simplification of the schema itself might have larger effect on efficiency than any optimisation efforts.

Also we believe that simplification of the schema will improve adoption of the format or general attitude of people towards the perceived complexity/redundancy of the format.

I believe that using O&M for representing measurements in IWXXM is an independent thing from O&M being used by METCE for support of WIGOS. Even though O&M will be used for WIGOS station identifiers, it does not prevent removal of O&M from other places where it is currently used in IWXXM.

I know that that all this requires updates of software systems for everybody to the new operational IWXXM version each time a new Amendment comes into effect. But it also secures interoperability where everyone will understand all reports. In other words, if my system is not updated and I will receive the report encoded in higher version, then I will not be able to understand it directly. This is similar to situations like when TAF validity format changed in the past - any software systems that wanted to understand the new format needed to be upgraded.

It is quite apparent that most people are thinking about TAC METARs in the discussions. METAR format does not change so much in Annex 3 amendments when compared to TAF or SIGMET. If everybody in the world really was decoding/processing/visualising not only METARs as it is now, but also TAC TAF and TAC SIGMET, then people would be equally forced to upgrade their systems with each Annex 3 Amendment if they wanted algorithms to understand reports from other states. But as it stands now most people just display a latest TAC TAF for a set of aerodromes - and you do not need to be able to decode the body of a TAC TAF to figure which one is the latest. Similarly in TAC you do not need to be able to decode the full SIGMET body if you only want to include the current SIGMETs in flight briefing documents. This only becomes important for people that want to graphically visualise these reports on map - and we think that is one of the goals we are reaching for with IWXXM, to be able to reliably decode and visualise the reports.

In our case we have to perform software upgrades for most of our aviation users with each Annex 3 amendment, because we deal with all the report types. But we understand that if some vendor is interested in producing/consuming METAR only (which as we know is much more "stable" compared to other reports) then the requirement to do software upgrades every 2-3 years seems superfluous. However even if you are a country with 800 METAR stations, as far as we understand the stations typically communicate with a central system that collects measurements, i.e. the METAR reports are not formatted by the station itself but by a central "DB"... But we could be mistaken, there are probably still many stations in the wild where METAR is coded directly at the station.

sforeman commented 6 years ago

The WIGOS metadata team is equally struggling with the applicability of O&M to what they are doing. Most people find it too abstract to think of a description of the conditions at a station to be an observation about the state of the station.

That said, the WIGOS work is only about the station metadata at present - we don't have any requests for exchanging (most types) of observation in O&M. The one exception is hydrological observations that are based on WaterML2 - that is based on O&M.

I think the issue for iwxxm is that we are trying to represent a set of business rules, rather than descriptions of the real world ("CAVOK" is the most obvious example of this). In the future "data centric SWIM" world, then there will be a need to provide some metadata (process) information associated with the values exchanged that is at present implicit because the data are exchanged in one of the standard packages (metar, taf, etc). Once we deliver individual elements (eg temperature) to users who may combine it with other information, then it will become more relevant to include with the value being exchanged that it was produced by the "taf" process (ie it is a forecast) or the "metar" process (ie it is an observation). I am using taf and metar as short hand for the full set of conditions of the process (averaged over x minutes leading up to the observation time, etc).

braeckel commented 6 years ago

The PR was accepted and this issue can be closed