wmo-im / wis2box

WIS2 in a box is a reference implementation of a WMO WIS2 Node
https://docs.wis2box.wis.wmo.int
Apache License 2.0
40 stars 16 forks source link

add data pipeline for hydrology data #703

Open tomkralidis opened 5 months ago

tomkralidis commented 5 months ago

Add pipeline(s) to:

Notes:

ksonda commented 5 months ago

Just talked to @dblodgett-usgs We should consider covJSON w/ waterml2 use case elements

tomkralidis commented 5 months ago

Thanks @ksonda. CoverageJSON is default output from pygeoapi EDR support, so we would get it for free once there is an EDR plugin for a relevant backend.

ksonda commented 5 months ago

agree, sounds like the least cost path forward to me...

dblodgett-usgs commented 5 months ago

A couple thoughts about this suggestion.

There are two use cases here -- 1) the "Web data" use case which requires a convention to encode key elements of data for plots and some site metadata and 2) the "data exchange" use case which requires a convention to encode more precise data contents that are unique to the hydrometric station timeseries use cases supported by WaterML2 part 1.

IMHO, it would be best to just use CoverageJSON for the timeseries payload and a GeoJSON-compatible json-schema for the site metadata. If there are critical metadata nuances that can not be captured in a satisfactory way in CoverageJSON, then perhaps we jump into a full json encoding of timeseriesML/WaterML2 Part 1.

I'd be happy to contribute to this effort as it unfolds and really appreciate your efforts on this!!

ksonda commented 5 months ago

Curious if there's room in the EDR spec to cover both use cases, given that /locations is just supposed to be an geojson endpoint of some kind with the schema defined in the open api doc

dblodgett-usgs commented 5 months ago

Probably yes -- for the more complex use case, the WaterML2 Part 1 metadata for time value pair metadata and the ability to alter default per time step metadata is the part that is going to be complicated and EDR has no issue with additional media types from the .../locations end point. Same for .../items, additional media types for features are well supported.

ksonda commented 5 months ago

hmmthat is tricky. Could specify for each parameter a <parameter> _ metadata whose associated range could be a nested array of metadata elements. but i think that breaks other covJSON use cases about slicing and such that assume unidimensional arrays. Also no obvious query mechanism via EDR spec.

Alternative1: best practice specify for each parameter, a <parameter> _ metadata whose range is an array of URI for pointing to some other EDR item which is nested array of "tvp" metadata elements. But then can't select specific metadata elements.

Alternative2: best practice specify for each parameter, a <parameter> metadata <metadata-element> that carries its own range? Clunky, but selectable via parameter query parameter.

dblodgett-usgs commented 5 months ago

I need to go back and read the spec and think about it some. As an initial take, just doing the happy path CoverageJSON with as much of the WaterML2 spec as "just works" would be a really great step!!

ksonda commented 5 months ago

Straightforward:

Seems hacky but does in fact have relevant guidance in the spec:

Unclear:

EDR can maybe handle via /locations, but custom handling by the service is one thing and cross-protocol data exchange is another :(

To force in covJSON options

  1. Ignore
  2. Each Coverage is one time series for one station. each parameter is actually a unique combination of observedproperty and method. add method object as a custom field in paremeter and add station metadata fields as custom fields at the Coverage level. Y think this is closest to the waterml2 xml structure, but I'm not sure if it would break things with vanilla covJSON clients.
  3. Each parameter is a unique combination of station-observedproperty-method, and every station and method metadata element is a custom field in the parameter. Use ParameterGroups liberally to group these parameters by observedproperty.
unep-gwdc commented 5 months ago

That's a quite different approach to what was discussed last week during the HDWG meeting between the colleagues from the WQ IE @sgrellet, @KathiSchleidt, @hylkevds and Rob Atkinson to move towards an update of TSML with a hydro profile/extension and JSON encoding.

I agree with @dblodgett-usgs that we need a fair bit of metadata for the data exchange use case to make it work with WHOS, for station & measurement metadata WIS relies on WIGOS OSCAR/Surface metadata but this is fairly complex XML and hardly implemented by the hydro community so far.

I think this needs a more in-depth discussion within the HDWG and maybe beyond as this is also relevant for other domains

ksonda commented 5 months ago

I think more discussion is good, but as part of that discussion I think it is worth seeing what is reusable or adaptable from straight geojson and covjson rather than assuming a priori we must have an entirely custom new json format from first principles. That may end up being the case, it may not. Above was just getting a start on how covJSON could fit in on the record.

dblodgett-usgs commented 5 months ago

I regret that I was not able to take part in the discussion at the HDWG meeting -- family vacation took precedence.

I fully expect that there is a need to do both. The Web use case could be (kind of has to be) satisfied by geojson and covjson because a boutique format won't be broadly supported / adoptable for Webby use cases.

There may be a world where a JSON encoding of TSML with a WaterML2 Part 1 profile or best practice would be a critical format for data exchange but it would need to be in addition to more accessible Web formats.

There also may be a world where we could establish a convention that would "just work" as geojson and covjson but I have a hard time seeing the compromises necessary for such a convention being acceptable to either Web or data exchange use cases. This is why I make the assertion up front that we probably need to do both at some level.

So, let's run with use of existing accessible formats and focus on Web use cases with as much data exchange content as fits easily?

ksonda commented 5 months ago

Something we've been batting about as an experiment to reveal the opportunities and limitations of the existing constellation of standards for the "webby" use case.

  1. Target a best practice doc for STA that allows a specific STA query to be proxied by EDR in a manner that allows there to be a covJSON output format that delivers the information that the community would want to see in a hydro profile of TSML.

  2. Define a best practice in covJSON for the packaging of station metadata with their time series.

Why?

  1. The webby use case for a time series JSON data packet implies a strong preference for at least a station name or id to go along with time series in the same document
  2. Between the WQIE and Hydroserver2, STA is positioned to gain prominence in the hydro community, while EDR is gaining prominence in the Met community. An STA -> EDR mapping has been under discussion before and this would be a way to move that conversation forward at the same time
  3. EDR gives covJSON
webb-ben commented 4 months ago

Related to https://github.com/wmo-im/tt-w4h/issues/28

tomkralidis commented 3 months ago

@webb-ben and I met/discussed this today. Proposed way forward:

dblodgett-usgs commented 3 months ago

If this all pans out, it will be a very positive step. Thanks for taking it on guys!

webb-ben commented 3 months ago

covJSON has the important concepts in Waterml2 other than TVP metadata which is of arguable importance to most people, and can be covered by additional parameters if necessary. SensorThingsAPI has all the concepts in waterml2 including TVP metadata so it would be very simple to write a simplified rewrap of STA JSON for a “complete” profile

sgrellet commented 3 months ago

SensorThingsAPI has all the concepts in waterml2 including TVP metadata

Agreed, we partly prototyped this during the OGC WaterQuality IE. Following this we add discussion within TT-WIS2 for Hydrology (with @unep-gwdc and washington ).

We just shot an email to both OGC hydrodwg and tsml swg about how we could push this aspect forward. Feel free to contribute/raise interest

ksonda commented 2 months ago

One idea for TVP metadata in a covJSON context is a best practice that says, 1 coverage = 1 station with timeseries for 1 parameter. Any additional parameters shall be TVP metadata fields for that timeseries (e.g. data status or quality codes)

KathiSchleidt commented 2 months ago

Sorry for chiming in late here, but REALLY needed some vacation!

Continuing on @dblodgett-usgs UC differentiation into 1) "Web data": Simple, Sweet, Stupid :) All you need are location and values for ObsProps. 2) "Data exchange": Full gory glory ;) All the details you need to vet the data.

We've realized for quite a while under OMS/STA that most real-world UC split into these 2 views, first you look at the details, see if the data is fit for purpose. Once you've done that, you rarely look at this detailed view again, just want the simplified Geometry and a number view. We've been chewing on such simplified result formats for STA, have done a CSV result format for such purposes, could see using CovJSON here (or proxying the STA data via EDR with CovJSON output). Trick will be providing backlinks to the full "Data exchange" view for the case that folks want to go back to the details.

As Sylvain has mentioned, in the WQ IE we've shown how well STA works for the "Data Exchange" UC, I'd be all for doing a WaterML profile for STA, defining what attributes should be in the properties blocks of the various STA classes.

On the webby view, while I still need to take a closer look at CovJSON (my brain still things in the various CIS encodings), I'd like explore providing more than one time series for one station. To my memory, CovJSON does support non-spatiotemporal dimensions - couldn't one set up one dimension for stations, one for ObsProp, one for time?

On TSML: had good discussions with Paul Hershberg during my vacation (if I'm in DC anyway, too good an opportunity to waste!). Plan is for me to co-chair TSML, build on the preliminary work Paul and I did this spring, try and get at least the conceptual model for TSML done by the end of the year (need to align to the updates in both OMS and OGC Coverage models) then can be integrated into the WaterML Update while we work on the encodings Btw - has anybody ever bothered to align CovJSON with OGC Coverage? Think it could be done fairly easily, would make the step from conceptual to encoding that much easier!!!

As a first step, I'd really appreciate samples of different water time-series encodings, see how reality aligns to the timeseries options available under TSML

ksonda commented 2 months ago

In practical terms, I think an easy-to-use software stack that provides WebUI <-"webby" covJSON <- EDR <- STA -> EDR -> Data "Exchange" TSML JSON would be a nice "house" to aim for.

That being said, I do think some kind of simplified proxy layer like EDR is necessary over STA for something like TSML becuase OData is too flexible to provide predictable outputs IMHO. STA-based Clients would have to hardcode a specific query like https://labs.waterdata.usgs.gov/sta/v1.1/Things?$filter=name%20eq%20%27USGS-02085000%27%20or%20name%20eq%20%27USGS-0209734440%27&$expand=Locations,Datastreams($expand=ObservedProperty,Sensor,Observations($top=30)) to get the info roughly equivalent to WaterML2/TSML, or an ?f=tsmlJSONor whatever on STA would only work an a limited set of query patterns

@KathiSchleidt re covJSON for more than one obsprop per station. I think its worth exploring. Outstanding issues from my perspective:

KathiSchleidt commented 2 months ago

@ksonda trying to understand the 2 ends of your chain:

What's the different between your 2 versions? To my view, they both look like simplified data exchange. I don't see how either version provides access to relevant observational metadata, e.g. ObservingProcedure, Observer...

TSML foresees 2 types of timeseries encodings:

ksonda commented 2 months ago

I think we agree that there can be two JSON encodings of WaterML2/TSML, one that is aligned with some kind of coverage data model for simple use cases and one that has ObsProc, platform/host/samplingfeature, observer/sensor, result-level metadata, etc. for analytical ones.

I see the latter as being developed totally independent of STA rather than tied to STA.

The issue is if we want a TSML JSON flavor to include all the more detailed metadata, for it to be a predictable, valid resultFormat in STA without missing values in nominally required fields., it would require some pretty specific query patterns in STA. EDR's query patterns are constrained enough but the outputs flexible enough that all you need is declare the schema of JSON response in the API document.

e.g.

An implementer of an EDR that is wrapping an underlying STA endpoint would have no trouble delivering this document if there is a specification of the result format and its declared as an available format in the openAPI document. It's just configuring the STA query within the EDR implementation just like you'd have to write SQL queries directly if the EDR directly connected to SQL database.

Just trying to delivera "full" WaterML2/TSML document via an STA encoding could lead to situations like,

example.com/sta/v1.1/Things(1)?$expand=Datastreams($expand=Observations)&$resultFormat=newflavorJSON

How should this be handled? Will the STA server deliver the geometry of the Location of the Thing even if it wasn't asked for explicitly? Or the ObcProc information that is presumably supplied in the related Sensor entities? Will there be an error message? Will neither be supplied and its up to the user to always specify the exact combination of information they want?

ksonda commented 2 months ago

I'm sorry I think I introduced too much about particualr APIs like STA and EDR at all.

I think we're in agreement that there can be a coverage/domainRange type "webby" format and a more detailed format. Both can be pursued in parallel. Both are agnostic to underlying APIs. The above discussion about STA/EDR shuold be taken up once we get to the point that we're piloting something about interoperability between different data providers. I don't think STA/EDR should inform the design of either format.

KathiSchleidt commented 2 months ago

@ksonda On your statement on 2 encodings, one aligned with Coverage, one with the concepts from OMS, if you take the time to look at the TSML model, you'll see that it's always had 2 approaches, Coverage and TimeValue Pairs. However, as TSML stands, they both link to O&M Observations for the O&M Concepts.

While I understand your logic in trying to integrate all concepts from OMS into CovJSON, this would require:

The approach I've been working on with Paul Hershberg foresees the following encodings (with links between the simple and complex views, allowing a user to switch between the approaches depending on what data they require):

However, as we don't want to be tightly bound to encodings, it looks like we'll first be creating a logical model that takes some complexity out of the conceptual model (e.g. going to soft typing, thus avoiding all the specializations you see in the conceptual model ). Then we can figure out what concrete encodings and APIs we define.

Admittedly, STA requests can become complex (with great power comes great responsibility! ;) ), probably valuable to provide some standard requests. The most dependable STA I have available is the one with EU Air Quality, there you can get everything you need including location with the following request:

https://airquality-frost.k8s.ilt-dmz.iosb.fraunhofer.de/v1.1/Things(1)?$select=name&$expand=Locations($select=location),%20Datastreams($select=description,%20unitOfMeasurement;$expand=ObservedProperty($select=name),%20Observations($select=phenomenonTime,%20result;%20$orderby=phenomenonTime%20desc;$top=10),%20Sensor($select=name))

What are you missing?

ksonda commented 2 months ago

Im not trying to link all concepts in OMS to covJSON anymore. The EDR spec says you can provide any encoding you want, as long as you specify what it is in the API description document. I somewhat agree in principle with this


- CovJSON/EDR for the simple webby view

- STA (or connected-systems, OGC API - SOSA) for the complex view

But the very nature of the second bullet point to me suggests that a "complex view" encoding needs to be specified agnostic to EDR, STA/ CS/ OA-SOSA (or anything else). For that matter, I don't think the "webby" view needs to be dependent on EDR. covJSON, with some minimal best practice addendum that allows for station name/location metadata, would be enough for the webby view.

If I am some international or national entity trying to aggregate "complex view" streamgage data from > 5 or so subnational data providers, it's going to be more feasible for me to as ask that each provider set up an arbitrary mechanism that makes sense for them, to give me a consistent encoding that is supported by multiple open API standards (and possible to be implemented by vendors' APIs, or just custom workflows that host documents in some web-available folder)) than for me to ask that each entity set up an STA endpoint or TBD OA-SOSA endpoint.

I think its fine to define a logical model ahead of an encoding, but don't want to lose sight of the fact that many clients and workflows would expect a standard encoding or at least an encoding that is well specified in an API definition document.

I agree that in general, any number of STA queries would give the information one might expect in a "complex view". I just want the "complex view", in a consistent encoding, to be able to be made available from STA, and EDR, and custom APIs, and non-API modes of data exchange.

I'm not missing anything content wise from that STA request (supposing the provider has all the metadata in there). I just think it's putting the cart before the horse. I want there to be a standard encoding (that is JSON but not necessarily covJSON) such that STA/request?$resultFormat=complexview-tsml-json and EDR/collections/whatever/query?f=complexview-tsml-json and someone just putting daily station summary documents in an s3 bucket directory called /stations/complex-tsml/ with the file name station1-2024-01-01.json would provide documents that could be consumed by the exact same client function.

The problem is more that, as you said, you'd need that to be a "standard request" to ensure that such a requested resultFormat from STA would actually give you all that info. And it's an open question how you would document in a machine-readable way this STA endpoint provides "complexview-tsml-json", simply specify that as the resultFormat and ensure that your query is of the form "https://airquality-frost.k8s.ilt-dmz.iosb.fraunhofer.de/v1.1/Things(1)?$select=name&$expand=Locations($select=location),%20Datastreams($select=description,%20unitOfMeasurement;$expand=ObservedProperty($select=name),%20Observations($select=phenomenonTime,%20result;%20$orderby=phenomenonTime%20desc;$top=10),%20Sensor($select=name)))"

sgrellet commented 2 months ago

STA/request?$resultFormat=complexview-tsml-json and EDR/collections/whatever/query?f=complexview-tsml-json and someone just putting daily station summary documents in an s3 bucket directory called /stations/complex-tsml/ with the file name station1-2024-01-01.json would provide documents that could be consumed by the exact same client function.

Which I interpret as : with no element specific from the API providing it. This is something we mentioned several time as a target in OMS webconfs.

Still we haven’t made progress on it -> why ? because it’s super hard to get rid of all the APIs’ specific « decorations » ?

hylkevds commented 2 months ago

I would say that what is described here is actually a third option next to the "Webby" and "Complex" views: a standardised export format.

One problem is to define what needs to go into this format, and how one specifies which sub-selection of all data in a service one is interested in, or what the default "selection" is. Since there is no possibility to add nextLinks or other forms of pagination, there has to be some other way to keep the files at a manageable size.

The second problem is that these will, by their nature, be quite verbose, and contain much duplicate data. Since each file will have to be self-contained.

I do think that TSML is a very good candidate for this. As would be a potential JSON Encoding for OMS.

unep-gwdc commented 2 months ago

+1 on standardized (export) format. I think the separation of encoding and API is key to enable publication of hydrological observation data either without having to set up one/several standard web services or re-using existing, non-standard web services (which will be the majority of data providers in the foreseeable future).

With respect to WIS2 and WIS2Box, the idea is that the National Met Service runs the WIS2box node and the data provider, e.g. National Hydrological Service would provide file-based metadata and observation data that is then translated into the standard-compliant metadata and observation data file(s).

WIS2 uses GeoJSON discovery metadata at the dataset-level (using WCMP 2, https://wmo-im.github.io/wcmp2/standard/wcmp2-DRAFT.html) linking to the data, either served as files from a WAF or through APIs. For the APIs there are templated links (https://wmo-im.github.io/wcmp2/standard/wcmp2-DRAFT.html#_1_19_2_templated_links) that can be used to guide data access.

Not sure whether this would cover @ksonda thoughts on separation and documentation but could serve as a starting point.

dblodgett-usgs commented 2 months ago

We have two things floating around in this thread -- one that I think is not being stated outright and want to clarify.

What we are saying out loud: The distinction between Web use cases, where convenience for common Web use cases is paramount, and data exchange use cases, where precision of observation documentation is the primary (but still not dogmatically important) objective. At the end of the day, there is a adoption dynamic that we must confront. The Web will not wait for the standards community but data exchange has requirements that necessitate more care in standardization.

What is discussed above without naming the design consideration is the distinction between APIs that allow you to define your document structure via API constructs and those that provide domain filtering only.

The pattern of tight coupling of APIs to payload is super useful in some contexts (see the explosion of graphql and odata) but creates considerable hurdles when trying to pursue broad interoperability at the community (e.g. hydroscience) and cross-community (e.g. emergency response) levels. As Kyle illustrated, it's unrealistic to expect every hydro-met service to adopt the same coupled API/payload pattern in pursuit of community-level interoperability.

The pattern of decoupling content from API is limiting in that you get what you get and can't (without anti-patterns or overloads) restrict the content returned. It's a fair critique -- but the separation of concerns and resulting architectural freedom it creates is worth every penny of sacrifice.

<minor diversion>

We can still offer "lighweight" documents through extended format lists (e.g. f=tsml-json for the whole shebange or f=tsml-json-result for only the result ) which strikes me as a utilitarian anti-pattern worth considering... or overloads akin to "vendor parameters" that allow you to mutate the response format in a non-standard and non-invasive way.

</minor diversion>

This is all to say that it is critical that we seek out an arrangement where the resource model we define as our interoperability target is not coupled to the Web API that we are going to use to filter the parameter and spatiotemporal range of data. By extension, that interoperability target needs to be holistic. Within that, if we are going to capture Web use cases, we can't load up a document with stuff that will slow down or confuse Web developers and their applications (Webby) and we also need to define a way for those who need a fuller picture to share and / or access that picture in a precise and complete form.

There's been quite a lot of piling on to this thread -- Can someone suggest another venue to continue this discussion? Perhaps the TSML github?

We should yield the floor to @tomkralidis and @webb-ben to continue as was laid out here: https://github.com/wmo-im/wis2box/issues/703#issuecomment-2299544522