initial WIS 2.0 metadata/search brainstorming/ideas

wmo-im / wis2-metadata-search

WIS2 Metadata Search

https://community.wmo.int/activity-areas/wmo-information-system-wis

MIT License

1 stars 1 forks source link

initial WIS 2.0 metadata/search brainstorming/ideas #1

Open tomkralidis opened 3 years ago

tomkralidis commented 3 years ago

@wmo-im/tt-wismd / @wmo-im/tt-wigosmd in relation to WIS 2.0 and the metadata search demonstration project, notes from initial discussion with discussion with @6a6d74 (2020-12-15).

Note that these are initial ideas only for discussion with ET-Metadata. Please review and provide your thoughts and perspectives here, thanks.

Drivers

lower the barrier to entry
FAIR data principles
Web architecture/hypermedia
webby/of the web
search engine friendly

Metadata Standards

WIS and WIGOS metadata
- linkage between dataset and the platform the generated/collected the data
- a discovery metadata record should be able to reference a WIGOS metadata record (in OSCAR)
DCAT2: dataset+multiple realizations
- unique identifiers are first class
- consider community standards

Harvesting

suppliers provide URLs to metadata
harvest a set of metadata terms out of that that record, from a set of known formats (adapter pattern)
core tooling for data providers for converting their bespoke metadata into recognized formats if needed
- data providers can contribute their converter to tooling (core+extension/plugin)
what is the machinery to harvest/push/pull records to a GISC destination

Catalogue options

The browser as the catalogue

is the browser search engine
WIS catalogue is NOT a primary search endpoint
probably doesn't need duplicated in each GISC
harvest from closest point to authoritative source
Structured data
- e.g. Google Dataset search
schema.org annotations

Definitive WIS catalogue

People don't trust search engines
provide a vanilla search experience without "value add" from search engines to prioritize or promote various things
to assert the definitive list [authoritative data] as recognized by WMO
- approved by PRs
- quality statement
- use this 'quality statement' identify quality / authority of datasets; enable search engines to see what is official
Searching from applications (e.g. GIS Desktop, QGIS, ArcGIS)
- sensible for WIS Catalogue to provide an API
- need to consider performance/availability
Metadata in the WIS Catalogue
- WIS Catalogue only holds the smallest amount of metadata needed
- refer back to the original metadata for the full description
- meta-metadata, with link back to full metadata record
- example:
  - identifier
  - type
  - title
  - abstract
  - keywords
  - extents
  - links
  - license
  - provenance
  - schema.org annotations
availability/uptime considerations
- operational? 24x7?
- number of instances? Synchronization? Or harvest metadata direct from source

Guidance and support to members

needed for NCs and DCPCs to do this to make their data searchable on Google, for example
- e.g. publishing a schema.org record
- tools for transformation/migration from WCMP

tomkralidis commented 3 years ago

Further discussion with @efucile (2021-02-15) (cc @petersilva)

current WIS catalogue full of real-time traffic/metadata
very granular
we need to have higher level of granularity and allow for actionable links via PubSub for clients to work with lower granules
we need to align to WIS 2.0 topics as a means to:
- define vocabulary/themes/concepts/terms/schemes in metadata (see ~~https://github.com/petersilva/GTStoWIS2#conventions~~ https://github.com/wmo-im/GTStoWIS2#conventions)
- provide as queryables for search

petersilva commented 3 years ago

https://github.com/wmo-im/GTStoWIS2#conventions better to get the shared repo, than my personal repo.

as discussed with @tomkralidis: The tables from WMO 386 Attachment II-5 are in the GTStoWIS2 folder in JSON format, and are chained together. Somebody should be able to string the tables together to produce one big table of all possible topics, but I remember @antje-s doing something akin to that, but it resulted in impractically large tables. I think it would have to be done with a keen appreciatiation for how all the tables link together, it is perhaps not so large then.

petersilva commented 3 years ago

I just went onto my server behind my experimental prototype ( https://hpfx.collab.science.gc.ca/~pas037/WMO_Sketch ) and did:


pas037@hpfx1:~/public_html/WMO_Sketch/20210215T08$ find WIS -type d  | wc -l
8929
pas037@hpfx1:~/public_html/WMO_Sketch/20210215T08$

for most countries, the hierarchy in a given hour is relatively simple... Here is what the topic hierarchy for Italy at 8Z looks like on my prototype (using the GTStoWIS2 module from the repo):


WIS/it
WIS/it/roma_met_com_centre
WIS/it/roma_met_com_centre/surface
WIS/it/roma_met_com_centre/surface/aviation
WIS/it/roma_met_com_centre/surface/aviation/metar
WIS/it/roma_met_com_centre/surface/aviation/metar/it
WIS/it/roma_met_com_centre/surface/aviation/speci
WIS/it/roma_met_com_centre/surface/aviation/speci/it
WIS/it/roma_met_com_centre/observation
WIS/it/roma_met_com_centre/observation/surface
WIS/it/roma_met_com_centre/observation/surface/land
WIS/it/roma_met_com_centre/observation/surface/land/fixed
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop/non-standard
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop/non-standard/0-90n
WIS/it/roma_met_com_centre/observation/surface/land/fixed/synop/non-standard/0-90n/90e-0
WIS/it/roma_met_com_centre/forecast
WIS/it/roma_met_com_centre/forecast/aviation
WIS/it/roma_met_com_centre/forecast/aviation/taf
WIS/it/roma_met_com_centre/forecast/aviation/taf/under12hours
WIS/it/roma_met_com_centre/forecast/aviation/taf/under12hours/it

My prototype feed is from the UNIDATA, so heavily biased with US data. when I look at the tree:

pas037@hpfx1:~/public_html/WMO_Sketch/20210215T08$ find WIS -type d  | grep WIS/us | wc -l
5967

There is a lot of:


WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/06h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/12h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/24h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/03h
WIS/us/KWEG/model-regional/cloud/0-90n/0-90w/27h

WIS/us/KWEC/model-regional/wave/0-90n/0-90w
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/24h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/27h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/09h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/12h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/18h
WIS/us/KWEC/model-regional/wave/0-90n/0-90w/21h

One of the things we were debating is whether it makes sense to have the prediction hour in the topic hierarchy. Is it really that often that people will want only the 21h forecast from a given product? We were thinking that we could eliminate a lot of topics, if we just remove the prediction hour from the topic tree. ( https://github.com/wmo-im/GTStoWIS2/issues/5 ) It could shrink the tree, putting all forecast hours under the geographical topic, with the hour to differentiate them being left to the file name.

I think this is more practical, but it runs counter to the idea of metadata being "very granular" ... I think the temporal information is too granular for inclusion in the topic tree, but would appreciate other views.

josusky commented 3 years ago

I agree with removal of the forecast hour from the tree. It seems more practical to leave possible forecast hour filtering, if needed, at the client's discretion. In fact, NWP model data are a good candidate for distribution through a service. Once this happens, then instead of sending 100s (1000s) of notification for each model run (about 100s/1000s of new files) your system will send one notification telling that the service has a new run available. For now, those who want just a subset of forecast hours will discard 100s/1000s of small notifications few times per day.

petersilva commented 3 years ago

note... the number of notifications is not changed... it is just that all of the outputs will be under the same topic, with different file names. You will subscribe to the KWEC/model-regional/0-90n/0-90w (aka atlantic ocean above the equatorish...) and there will be a file for each hour published under the same topic. For what it is worth, in operational forecasting, the 6 hour forecast is available before the 12 hour, they 18hour etc... so one announcement for the entire run would be unsuitable for real-time use, as it could delay transmission for ... usually upto an hour or so (I don't know about other countries, but in Canada, the "regional" (adaptive grid over North America) run is about 45 minutes long, the global (analogous to ECMWF guidance) is around 90 minutes long.) There are also more localized grids that have similar performance profiles to the regional (aka HRDPS.)

josusky commented 3 years ago

Sorry, I did not check how this particular model is distributed. If its one file per forecast hour, then it is fine. My point was that some models produce hundreds of files per run. It is not a problem to filter out such number of notifications on client, but still it would nicer if the service just sent a notification when some new logical set of data becomes available. But we have diverged. The case with forecast hour can be declared closed.

josusky commented 3 years ago

When it concerns granularity mentioned by @tomkralidis. In WIS 1.0 we have one metadata records per GTS bulletin, but many of them are logically from the same category - e.g. surface observation from a certain part of the world. So in the proposed topic hierarchy they will naturally form sub-trees and so the subscribing will be easier.

petersilva commented 3 years ago

I think this is intimately related to: https://github.com/wmo-im/GTStoWIS2/issues/9

tomkralidis commented 3 years ago

thanks @josusky. IMO we want a higher level of granularity so the WIS 2.0 catalogue does not become a bulletin search API, but a yellow pages so one can find/bind accordingly.

tomkralidis commented 3 years ago

as discussed with @tomkralidis: The tables from WMO 386 Attachment II-5 are in the GTStoWIS2 folder in JSON format, and are chained together. Somebody should be able to string the tables together to produce one big table of all possible topics, but I remember @antje-s doing something akin to that, but it resulted in impractically large tables. I think it would have to be done with a keen appreciatiation for how all the tables link together, it is perhaps not so large then.

If generating a 'supertable' is too large, can we describe the tables in question (C1, C2, C3, C6, C7, etc.) and their relationship? Perhaps this is described at https://github.com/wmo-im/GTStoWIS2#conventions ?

petersilva commented 3 years ago

Summary of the table linkages from WMO 386 Volume I Attachment II-5:

Table A : Data type designator T1 Matrix Table for T2A1A2ii definitions
Table B1 : Data type designator T2 (when T1 = A, C, F, N, S, T, U or W)
Table B2 : Data type designator T2 when T1 = D, G, H, X or Y)
Table B3 : Data type designator T2 (when T1 = I or J)
Table B4 : Data type designator T2 (when T1 = O)
Table B5 : Data type designator T2 (when T1 = E)
Table B6 : Data type designator T2 (when T1 = P, Q)
Table C1 : Geographical designators A1A2 for use in abbreviated headings T1T2A1A2ii CCCC YYGGgg
for bulletins containing meteorological information, excluding ships’ weather reports
and oceanographic data
Table C2 : Geographical designators A1A2 for use in abbreviated headings T1T2A1A2ii CCCC YYGGgg
for bulletins containing ships’ weather reports and oceanographic data including
reports from automatic marine stations
Table C3 : Geographical area designator A1 (when T1 = D, G, H, O, P, Q, T, X or Y) and geographical
area designator A2 (when T1 = I or J)
Table C4 : Reference time designator A2 (when T1 = D, G, H, J, O, P or T)
Table C5 : Reference time designator A2 (when T1 = Q, X or Y)
Table C6 : Data type designator A1 (when T1 = I or J)
Table C7 : Data type designator T2 and A1 (when T1 = K)
Table D1 : Level designator ii (when T1 = O)
Table D2 : Level designator ii (when T1 = D, G, H, J, P, Q, X or Y)
Table D3 : Level designator ii (when T1T2 = FA or UA)

If we drop hours, then Tables C4, and C5 disappear, how big is a supertablr? In the GTS2WIS module, @antje-s has already merged all of the table B's into one TableB that is about 400 lines.. or so.

TableA could be merged into TableB for about 4*26=104 entries... so about 504 for a hypothetical TableAB. C1 shows up 11 times in TableA, and has around 300 entries. ... so the table would add 33000 lines if C1 were included. We don't currently use C2,... weird... might be a gap. C3 has 28 entries and is present 11 times, so 308 entries. C6 has 121 entries and shows up only twice, so 242 entries. C7 has 91 entries and shows up only once.

so the total of a single recursive JSON array merging all the tables into one big one is: 504+33000+308+242+91=34135 round it off to 35000. A bit much for humans to understand, but you could just read all the existing table data into one big in memory thing... TableTTAAii.json if you like.

Then there are 6000 known origin codes (CCCC) of 15K known airports... that could originate such products in theory. ends up in the millions if you go there, so I guess we stop with just TTAAii with one table, and a second table for CCCC. The origin code maps to the first two levels of hierarhcy (Country/Centre) and the TTAAii stuff is the rest of it.

tomkralidis commented 3 years ago

35K is tractable (this is the size of NASA GCMD for example). Can we have workflow that autogen's the supertable from the smaller tables (I suppose easier to manage that way as well)?

petersilva commented 3 years ago

I made the code to do this in issue009 branch on GTStoWIS2. You can clone and reproduce it... it's around 277KB (only 17000 entries in the end... some math might have been wrong) with the tables in their current state. I had to add D1 and D2 tables, which were missing. Also there are some cases where there is a comparison to do (ii < 49, for example) where only the threshold is included... so it might be wrong for those cases. Unclear to me how it can be used for now.

antje-s commented 3 years ago

Some comments on the ideas from above...

_Metadata Standards

WIS and WIGOS metadata linkage between dataset and the platform the generated/collected the data_ --> transfer link should be enough (background architecture differs greatly) a discovery metadata record should be able to reference a WIGOS metadata record (in OSCAR) --> WIGOS-IDs are included in metadata as place keyword example gmd:descriptiveKeywords uuidref="place"/gmd:MD_Keywords/gmd:keyword/gco:CharacterString 0-20000-0-44203, Rinchinlhumbe [ http://oscar.wmo.int/OSCAR/wigosid=0-20000-0-44203] ... --> topic value could be part of the metadata record, e.g. under transfer options

_Catalogue options The browser as the catalogue

WIS catalogue is NOT a primary search endpoint_ --> if you do without your own search function, you also take away the possibility of your own search filter conversions, search result displays, search variants. this should be considered
schema.org annotations --> good, if you connect to the commercial search engines, you can also search there limited to your own website, e.g. www.google.de search for "meteogramm site:gisc.dwd.de"

_Definitive WIS catalogue

Metadata in the WIS Catalogue -- WIS Catalogue only holds the smallest amount of metadata needed -- refer back to the original metadata for the full description -- meta-metadata, with link back to full metadata record_ --> new concept, could be a solution to reduce size of single metadata record, but not solving the problem of granularity of the metadata, a grouping of the product metadata would be implemented by metadata on services, but there could also be different metadata for similar services, so an additional grouping of the metadata might be helpful (e.g. wcs-services, messaging services, data retrieval,...) --> linking metadata to NC / DCPC would mean each NC (so e.g. also the weather service of Burkina Faso) would have to operate a catalog