xcube-dev / xcube

xcube is a Python package for generating and exploiting data cubes powered by xarray, dask, and zarr.
https://xcube.readthedocs.io/
MIT License
183 stars 17 forks source link

xcube insitu, new cli #277

Open rabaneda opened 4 years ago

rabaneda commented 4 years ago

Is your feature request related to a problem? Please describe. xcube and xcube viewer were meant to be tools for satellite cubes. But, it seems that there is more and more interest in visualizing in situ data with xcube viewer. Hence, it is a good idea to add a cli in xcube to create cubes with in-situ data.

In-situ data can have many different formats. A pre-processing will always be necessary. But once the data is arranged into csv file/s, there should be a tool in xcube to create the cubes.

Describe the solution you'd like Assuming a pre-processing of in-situ data by the user, we need to set the requirements for the csv file/s to ingest. For example: time stamp, lon, lat, (and their formats) plus variables. The csv file/s will always be accompanied by a config file with attributes and variable attributes (metadata). The config file can also have some necessary parameters for ingestion (such as path, resolution, etc)

xcube insitu needs a method to verify csv files and config file. At least the presence of config file must be compulsory.

In-situ data can be arranged in one or multiple csv files. It will normally be data at one point(location) along the time. But, it could also cover an area. This is not a problem since the csv file will include lon and lat (except for resolution issues). However, we may want to include multiple points (where in-situ data was collected) into the same cube. Hence, it could be a good idea to have a different csv files for each point (mainly if in-situ data comes from different kind of devices). So, the input for xcube insitu will be a folder containing one or multiple csv files.

Then, it will be necessary to re-chunk. Chunks will cover the whole time dimension and small areas(lat/lon). The reason for this is to prune effectively afterwards and eliminate nan's.

Describe alternatives you've considered About the resolution, 10 metres will be the default, but this could be changed. Just in case, some in-situ data covers an area instead of a point (like lidar's, wave radars, ...).

Development I don't know if someone is already working in this. I'm happy to develop it myself. Nevertheless, it will be great if more people add requirements for these csv files.

forman commented 4 years ago

Have you had a look at the xcube extract CLI tool?

forman commented 4 years ago
(xcube) $ xcube extract --help
Usage: xcube extract [OPTIONS] CUBE POINTS

  Extract cube points.

  Extracts data cells from CUBE at coordinates given in each POINTS record
  and writes the resulting values to given output path and format.

  POINTS must be a CSV file that provides at least the columns "lon", "lat",
  and "time". The "lon" and "lat" columns provide a point's location in
  decimal degrees. The "time" column provides a point's date or date-time.
  Its format should preferably be ISO, but other formats may work as well.

Options:
  -o, --output OUTPUT  Output path. If omitted, output is written to stdout.
  -f, --format FORMAT  Output format. Currently, only 'csv' is supported.
  -C, --coords         Include cube cell coordinates in output.
  -B, --bounds         Include cube cell coordinate boundaries (if any) in
                       output.
  -I, --indexes        Include cube cell indexes in output.
  -R, --refs           Include point values as reference in output.
  --help               Show this message and exit.

...and please have a look into module xcube.core.extract.

rabaneda commented 4 years ago

After having a look at xcube extract, I think that xcube extract does the opposite I would like to do.

For xcube extract: The input is a cube (zarr, nc, mem) and the output is a csv file (I guess it could also be cube).

For the proposed xcube insitu: The input are csv file/s and the output is a cube (zarr, nc, mem).

xcube extract uses a csv file as input too, but just as a POINTS record. There is no variables data on that csv file for ingestion. (Besides, I think column 'time' should be optional. When 'time' is not given, the whole time series of variable/s at certain lat/lon will be the output. Otherwise, we'll need to create a long point (lat, lon , time) record because we'll need a point for each time step. I'll open another issue for this if after a deeper looking I think it's necessary).

My thoughts still are that there is no a CLI tool to ingest in-situ data and convert it to a cube.

forman commented 4 years ago

Why do you want the in-situ data to reside in an N-D data array? Of which almost all cells are empty?

rabaneda commented 4 years ago

The point is to include the possibility of using insitu data with xcube viewer. Not for the map, but for the time series graph and regression graph (if included in the viewer).

Another possible way is to include an option in the viewer to read csv files and compare with satellite data.

I'm aware that a cube where most of the cells are empty is not optimal (for that reason I suggested resampling and prune). But it should be a tool/option to add insitu data for comparison in xcube or xcube viewer. We can always use xcube extract and do the comparison ourselves out of the viewer but I think there is interest within dcs4cop to add insitu data to the viewer.

rabaneda commented 4 years ago

Almost forgot!!!!

Here you have 4 geojson files. 2 of them are geojson.Point for a fixed in situ sensor. The other 2 are geojson.GeometryCollection in case it is a moving buoy. This could be changed to Feature and FeatureCollection instead of Point and GeometryCollection.

There are 2 points and 2 geometry collections because I have arranged the data differently, but always following geojson standards. In the first one, I created new members for time and variables separately. The second format is a bit different since I created a member called "properties" which is a dictionary of dictionaries where the time is used as a key. In the second level, variables are the key. It is easier to understand by opening the files than reading this paragraph.

Another point is the time format. Geojson only accepts int, float, list, tuple, dict, and strings; but not numpy or pandas objects. So, I kept time as a string "dd/mm/aaaa HH:MM:SS". This could be changed to seconds from 01/01/1970, for example. In any case, the reader or parser in xcube viewer will need to know the format and transform it. insitu.zip

rabaneda commented 4 years ago

Here you have also the csv files from which I created the geojason files. insitu_csv.zip

rabaneda commented 4 years ago

I think I'm using this issue as a logbook to deal with the matter of the issue. Anyway, here are my thoughts.

Right now, the geojson parser within xcube won't be capable of reading and transforming the data from the geojson files I uploaded above. I mean it will read the data, but will not digest it to later match it up (or plot it) with data contained in the datacubes. For the match-ups, lon, lat and time are necessary. lon and lat won't be a problem with geojson. But, time needs to be in the same format on both datasets.

Therefore, the digestion of insitu data must definitely lead to a pandas dataframe with multiindex (perhaps geopandas) or multiple xarray datarray. So, we have 2 options

  1. Update/create the geojson parser to automatically create this dataframe/datarray. (Since Geojson is a dictionary, we could read a geojson and change the format of time from string/int to numpy time format. Then no need of dataframe/datarray. But will it work for plotting time series and match ups?? Won't it be too slow for long datasets?? We'll lose flexibility of pandas and xarray without dataframe/datarray. We'll need to write tedious code for time slicing and indexing.)

  2. Keep the parser as it is and then some code to create a dataframe/datarray from csv.

In any case, under the hood we need to link the geojson to a dataframe/datarray. Perhaps, just including the name of the csv file in a new field of the geojson and then reading the data on-the-fly when requested. It shouldn't take too long. Still, if we have multiple points or a moving point we could use a FeatureCollection where every Feature has a precise lon/lat with a csv file containing data for that precise point. Still we'll need to digest a csv file with in situ data into a geojson linked to one or multiple csv files.

The new field of each Feature, will contain the name/path of the csv file, metadata of the dataset, metadata for each of the variables on the csv file.

(I'll try to create those geojson files to show an example)