Rule COLLECT.MB1 in iwxxm.sch is overly restrictive

wmo-im / iwxxm

XML schema and Schematron for aviation weather data exchange

https://old.wmo.int/wiswiki/tiki-index.php%3Fpage=TT-AvXML

48 stars 22 forks source link

Rule COLLECT.MB1 in iwxxm.sch is overly restrictive #162

Closed blchoy closed 4 years ago

blchoy commented 5 years ago

Worapong Jirojkul of AEROTHAI has the following comment with regard to rule COLLECT.MB1 in iwxxm.sch:

Dear Mr. Choy,

Thank you again for your answers and piece of advice during the workshop in Bangkok last week :)

I have another issue regarding Schematron validation in collective. My understanding is a collective should carry the same type of OPMET information regardless of their versions. For example, we can have both v2.1.1 and v3.0 in a collective.

However, I have tried compiling a collective of mixed version METAR but it failed validation. So I try looking at the rule in 'COLLECT.MB1' which is:

count(distinct-values(for $item in //collect:meteorologicalInformation/child::node() return(node-name($item))))eq 1

I found that node-name returns QName which comprises local-name and namespace-uri and that is the problem because different versions have different uri even if they are of the same type. In addition, child::node() also returns text-node not just elements so I have a problem here as well.

So I try to tweak this a bit as follow:

count(distinct-values(for $item in //collect:meteorologicalInformation/child::* return(local-name($item))))eq 1

(The child::* will return only elements and local-name will return only element name) And now my mixed version collective successfully passes validation. I am not sure that I did it correctly for the most cases so please kindly check this issue.

Please find attached two sample collectives I used for the test, one for mixed version of METAR and another one for mixed METAR and TAF.

Test Files.zip

Best regards, Worapong AEROTHAI

I think this is a reasonable suggestion. However, we may only want to amend the rule in iwxxm.sch but not the one in the original collect.sch as aggregating feature instances of the same element name but different name spaces (which corresponds to different IWXXM versions) may be unique to IWXXM.

blchoy commented 5 years ago

A deeper investigation revealed that this is not as simple as what I thought. In fact, iwxxm.sch is version specific and can only be used with the IWXXM version it is designed for (e.g. Version 3.0). Then how come it can be used to validate a collective with multiple versions of an IWXXM report? Seems to me that we need a separate schematron file to validate collectives. [Updated: Probably not. Looking at the iwxxm.sch again it will check those parts in COLLECT 1.2 and IWXXM 3.0 only because the rule contexts have constrained to their respective namespaces. To validate those parts in other versions of IWXXM it can be as simple as running iwxxm.sch of all other versions of IWXXM. As a result, AEROTHAI's suggested changes can still be applied]

mgoberfield commented 5 years ago

Aren't member states supposed to validate their IWXXM messages with the appropriate version of iwxxm.sch before sending?

If a schematron file can't do different versions of IWXXM in a collective, then a RODB will have to "disassemble" those collectives which have different IWXXM messages in them and validate them separately.

Not really clear to me as to how these mixed version collectives come about and how many times these messages have to be validated.

blchoy commented 5 years ago

It all started when people are looking into the need for RODBs to aggregate IWXXM reports of METAR or TAF that are in different IWXXM versions. There are situations where schemas of some of the report types do not change from one version to another (e.g. a new report type is introduced in the new version only). As Annex 3 does not mandate the version of IWXXM to be used with an amendment all compliant version of schemas can be used, hence the possible occurrence of mixed version collectives.

As we are telling people that IWXXM 3.0 is the only version that is Amendment 78 compliant, there should be no mixed version collectives for the time being, but when the next version comes which is likely to be the one that include the new WAFCs SIGWX objects only, we should be able to handle this at that time.

blchoy commented 5 years ago

After doing some tests, the following summarize what we could do with regard to this issue:

Apply the patch to COLLECT.MB1 in iwxxm.sch as suggested by Worapong Jirojkul. This should allow validation of collectives with multiple versions of an IWXXM report type AND associated current version IWXXM reports.
To validate the remaining IWXXM reports of previous version(s), we can either: a. Patch COLLECT.MB1 in iwxxm.sch in previous versions of IWXXM to allow them to do the same things as in (1). In this case, the user will need to run the validator multiple times with iwxxm.sch from each version of IWXXM. b. Create a new iwxxm+collect.sch which includes COLLECT.MB1 and all rules in current and previous versions of IWXXM, each of which identified appropriately with their respective namespaces. This could be a tedious job but can be automated. In this case the validator will only need to run once with iwxxm+collect.sch. (Note: Here we assume the codelist RDFs are "backward compatible". Further discussion on how to ensure this may be required.)

In a separate discussion with @marqh, we may do nothing right now as we are not encouraging people to use versions prior to IWXXM 3.0 and the need to implement support for multi-IWXXM version collectives can therefore be delayed until IWXXM 3.1.

Views please?

jkorosi commented 5 years ago

Hi all,

we had quite a long discussion in IBL about this topic. The intention to collect different version of IWXXM reports is related to the question if all IWXXM users are able to update their software in time or even if they need to update it. It is obvious that updating to the latest IWXXM 3.0 version is not necessary, when I am using, for example, only METARs which are the same as in IWXXM 2.1.1 (let's suppose so). But if I don't update my system to IWXXM 3.0 then I will not understand these reports. Moreover, I will not be able to validate them because I don't have the latest validation rules. In other words, if we expect that there are several IWXXM version used in the real world in parallel, then it means that we expect that not all institutions are able to update their systems in time. Otherwise they would use the latest version.

So back to the original problem. There are three ways what we can do:

Do not allow mixing of different IWXXM reports Consequences: 1.1. Users will get several collections, each for one IWXXM version. 1.2. Each collection will have unique WMO heading / . 1.3. Users will be able to easily pick up collections which are supported by their systems. So they will not need to update to the latest IWXXM version immediately. 1.4. Nothing has to be changed in the current IWXXM design.
We support the idea of Mark Oberfield to disassemble the IWXXM collection into individual report documents before validation. For example by first loading the entire collection into an XML DOM and creating several smaller DOM documents based on the individual METAR/TAF "reports" stored in the collection. The programmer would then check the version of IWXXM used for each individual METAR or TAF, and use the respective iwxxm.sch from either 2.0, 2.1 or 3.0 (depending on what is in xmlns:iwxxm="http://icao.int/iwxxm/3.0") on the smaller document containing only one "report". A big advantage of this approach is that it is more efficient - it does not require running Schematron validation with multiple IWXXM versions over the entire collection. Also it does not require modifying any existing iwxxm.sch!
There will be specified exact operational IWXXM version for each report type in upcomming Amendment. We discussed this option at Honk Kong meeting. The result of this discussion is the "Validity of Schemas" table https://wiswiki.wmo.int/tiki-download_item_attachment.php?page=TT-AvXML-6&attId=267. It was transformed to https://wiswiki.wmo.int/tiki-index.php?page=dataModelStatus. If ICAO restricts the version for each report to only one then this troubles should be avoided. I think that lack of any rules according to operational versions is potentiotial source of troubles.

I know I went beyond our original topic, but I think now is the right time.

borisburger commented 5 years ago

It seems to us that collecting reports in one large XML brings quite a few complications, but not many obvious benefits. We understand it is now likely too late to change anything, but we are wondering if collecting reports as multiple XML files in a ZIP archive would be an easier solution? (as when for example Microsoft Word DOCX is a ZIP collection of multiple XML documents).

If ZIP collections were created, no modifications to the individual report XMLs would be necessary when creating the "bulletin" (this is the discussion on modifying source reports and where namespace declarations should be placed, and if they can be moved around in collections).
From programming point of view it is a more efficient solution, because it does not require creating XML DOM for the entire collection in memory in order to create the bulletin, you can just take the existing report XMLs and ZIP them.
Users could validate and decode the report XMLs one-by-one
In ZIP archives files are stored in order which they have been added (report order would be preserved)
In the current implementation collect:bulletinIndentifier does not contain anything apart from what a WMO heading already includes. In case more information would have to be stored pertaining the whole bulletin, it could be placed in a separate file in the ZIP (e.g. bulletin-info.xml)

blchoy commented 5 years ago

Hi @borisburger,

I think there is still some value at this point to have XML collectives of one or multiple IWXXM reports (equivalent to a TAC bulletin) as one can extract content from them directly without any pre-processing. Of course from a software engineer's perspective he/she can always create a filter which reads the zip file and expose the content in exactly the same way as the XML collectives. My personal preference is to do less right now as the need of collective will eventually gone with the introduction of SWIM (and WMO is also thinking of phasing out the WMO Abbreviated Header Line too). I agree that if collectives are being used much longer than we expect the problem will grow in time (i.e. the combined schematron rule file that validates a collective with multiple versions of IWXXM reports will grow with the number of IWXXM versions) and ZIP may be a better solution.

While we mentioned that collectives will become non-essential in SWIM, we have never analyse how publish/subscribe and WFS can lift the need for them. May be you could shed some light on this so that we could make some proposals for the MET-SWIM plan?

Regards, Choy

borisburger commented 5 years ago

Hi @blchoy, in WFS when user requests data for a list of aerodromes, or for a geographic area defined by latitude-longitude bounds, the WFS response will encode multiple "reports" in a wfs:FeatureCollection, which is a sequence of GML members. There is a more fundamental difference between GTS and WFS collections though:

WFS is on request - you control which information you want, so the collection in response only has data you are actually interested in.
GTS collections are prepared by "someone else" (a state collecting data for its own stations, or a larger communications centre doing bulletin compilation for a wider region with 50 METARs in one bulletin).

one can extract content from them directly without any pre-processing

In the traditional GTS world the collections will inevitably have a portion of reports you do not necessarily care about. A real-world use case is that users want to see data for one particular aerodrome (or several specific aerodromes). If it is "buried" in a collection of 50 unrelated METARs, the software needs to extract the data for the aerodrome that user wants from the large XML. Loading large XML DOMs into typical parsers is slow and memory consuming. That is why I do not understand you saying "without any preprocessing".

From a software engineer perspective it is more efficient to split IWXXM collections into individual per-report documents when storing them into any sort of "database". And when you split the collection, then you could also validate each such sub-document separately, alleviating the need for combined iwxxm.sch that would accumulate all IWXXM versions over time.

Regards, Boris

borisburger commented 5 years ago

I am not sure how WFS or publish/subscribe can remove the need for traditional GTS-like collections. Each has its strengths and weaknesses:

Traditional GTS-like store/forward based on static routing by WMO heading filters is well standardised. For many datasets it is easy to say from the WMO heading what data it represents and set up routing based on that info. It does not work so well for NWP or satellite data. The most well known weakness is that with store/forward you are ultimately getting more data then you need.
Publish/subscribe mechanisms allow for better granularity of being able to subscribe to the data subsets you need. But at the same time the current implementations of the catalogues are not easy to use, there is not enough standardisation in how people catalogue their data... Right now the subscription model of WMO WIS does not enjoy widespread adoption.
Web services like OpenGIS WMS, WCS, WFS are great for application developers and open new ways of accessing data (for people who need to query specific datasets). But they are not optimal for data redistribution or for weather monitoring (where a more traditional streaming data approach is more helpful). For observations or forecasts OpenGIS WFS does not define how the individual data "reports" should be encoded. IWXXM fills this void for aviation meteorology, but the rest of meteorology is up in the air. Also while web services are good for application developers, there still needs to be some form of data exchange between the centres/organisations that will be providing those web services to end users/applications.

I think there is no silver bullet here. Maybe publish/subscribe will replace traditional store/forward when WIS matures. Web services are great for specific use cases, but for widespread adoption more standardisation and maturing is due. In the transition period it is probably best to work with all the currently used approaches.

efucile commented 5 years ago

Hi @borisburger thanks for your comments. There are teams working fast on the implementation of pub/sub mechanisms and I am going to bring this discussion to their attention. I agree that there is no silver bullet. However one of the principles of WIS 2.0 is that we want to modify GTS in a way that collections are not needed, but I think that a complex work on the catalogue needs to be done to be able to provide effective access to data streams. Granularity is always a problem of balance on where we stop in detailing the data.

blchoy commented 5 years ago

Thanks @borisburger for your views. I am sure there are still a lot of design works to do to move forward to SWIM, especially when different stakeholders have different use cases and expectations (e.g. information producers may want to package his data for dissemination, database owners may want to scrutinize data and make it servable with the least effort, and end users may want to have easy access to the information they need). I am trying to address these with the new SIGWX objects for WAFCs' SIGWX chart and I will definitely involve the team when things are more mature.

But going back to our original issue, could @efucile shed some lights on my suggested moves?

efucile commented 5 years ago

First of all I have to remind everyone the current status of IWXXM 3.0 . We have submitted the new version for consultation to the national focal points (NFP) on codes matters. They have until 16 August to send comments and request for changes (none received yet). The only way to make a change now is to have the problem officially reported by a NFP as the ball is in their field now. After that no other opportunity will be open until next fast track in November for approval in May 2020. If a NFP is reporting such a complex problem this will force us to respond and delay the implementation date that now is 7 December 2019. This will affect mostly the space weather component that is the new part and we have a requirement to have it operational by end of year.

I think we should avoid this path. If you don't agree please comment on this otherwise I assume that we are not making any change for now.

Going into the specifics of the problem. Thanks to @blchoy and @jkorosi for your proposals. We need to decide the way forward. My considerations to facilitate the decision.

my opinion is that collections are a GTS communication artefact that won't survive for many years
in the processing systems I know the first thing that we do is to split the collections (or collectives) before doing anything else and the idea of having different subsets in BUFR to emulate collectives has given more problems than benefits.
I am in favor of giving a precise methodology to be adopted in the validation and stick with it for the future.

With this in mind we could adopt solution 1. from @jkorosi , but I doubt that everyone will fully comply. My experience is that mandating a rule like don't mix different versions of IWXXM in a collection will be disregarded in many situations and open the field to a number of incidents. At the end your software will need to take into consideration the case of a collection that "accidentally" contains different versions of IWXXM. Therefore I don't think we have any other option left other than 2. from @jkorosi .

Conclusion. I think we should clearly communicate to the users that collections are a communication artefact and validation of messages in a collection has to be performed individually after splitting the collection in single messages. I think that in the economy of our limited resources we should spend our energies to think how to get rid of the collections rather than improving them.

blchoy commented 5 years ago

I think we should clearly communicate to the users that collections are a communication artefact and validation of messages in a collection has to be performed individually after splitting the collection in single messages.

I concur with this conclusion. As we are not asking people to run iwxxm.sch with collectives the fact that COLLECT.MB1 in iwxxm.sch will complain with different versions of IWXXM reports in a collective shall not occur. In the next version of IWXXM (e.g. version 3.1) we can remove the redundant COLLECT.MB1 from iwxxm.sch.

marqh commented 5 years ago

At TT-AvXML-8 the team decided that this is a useful restriction for IWXXM 3.0 and shall now be considered for IWXXM 3.1

blchoy commented 4 years ago

While we will confirm WG-MIE's understanding that the only IWXXM version to be used since Nov 2019 will be 3.0.0, we may want to consider telling them what will happen when we started to have reports with no change in structure in IWXXM 3.1, no matter we want to keep COLLECT.MB1 to check the integrity of a collective or not. The following are the technical details:

Include a modified version of COLLECT.MB1 that allows multiple versions of a report (or remove it completely) in iwxxm.sch of IWXXM 3.1.0
Modify COLLECT.MB1 to make it allow multiple versions of a report (or remove it completely) in iwxxm.sch of IWXXM 3.0.0 and publish it as IWXXM 3.0.1
Tell users that during validation one will need to run iwxxm.sch of all IWXXM versions appearing in the collective

blchoy commented 4 years ago

This was fixed in PR #200.