Open TomNicholas opened 6 days ago
I'm not an expert on VRTs but I think it could work. It could potentially be useful if you want to create a dataset from rasters which are overlapping and the VRT represents an already dedupped version of the data (assuming the logic for deduplication is appropriate). Mostly, I'm not sure how useful it is to have this functionality because I am not familiar of VRTs that are made publicly available or published for general use. I have heard of VRTs being used for on-the-fly definition of mosaics.
I am also going to tag my colleagues @wildintellect and @vincentsarago who have more experience with VRTs than I do and may be able to think of reasons this may or may not work.
@abarciauskas-bgse converting a VRT to a Reference File for Zarr seems fine. I'm not sure the VRT would contain all the chunk information you need so the source files may also need to also be scanned. At that point it's not super different than just being given a list of files to include in a manifest.
Example:
<VRTDataset rasterXSize="512" rasterYSize="512">
<GeoTransform>440720.0, 60.0, 0.0, 3751320.0, 0.0, -60.0</GeoTransform>
<VRTRasterBand dataType="Byte" band="1">
<ColorInterp>Gray</ColorInterp>
<SimpleSource>
<SourceFilename relativeToVRT="1">utm.tif</SourceFilename>
<SourceBand>1</SourceBand>
<SrcRect xOff="0" yOff="0" xSize="512" ySize="512"/>
<DstRect xOff="0" yOff="0" xSize="512" ySize="512"/>
</SimpleSource>
</VRTRasterBand>
</VRTDataset>
Fun I didn't know about https://gdal.org/drivers/raster/vrt_multidimensional.html not sure I've ever seen one of these.
To be clear a VRT does not de-duplicate anything. When using a VRT with GDAL
If there is some amount of spatial overlapping between files, the order of files appearing in the list of source matter: files that are listed at the end are the ones from which the content will be fetched https://gdal.org/programs/gdalbuildvrt.html https://gdal.org/drivers/raster/vrt.html
So up to you if you'd want a VRT which takes effort, or would rather just be passed a list of files to include in a mosaiced reference file.
Here's a great one you can experiment with https://github.com/scottstanie/sardem/blob/master/sardem/data/cop_global.vrt This shows nested VRTs and point to a public dataset on AWS that is a global DEM with no overlaps, 1 projection, only 1 band, and 1 time point. So in some ways the simplest possible scenario.
Interesting thanks @wildintellect .
Thanks for clearing that up about de-duplication. I was under the impressions that VRTs could represent a mosaic after deduplication of source files (e.g. spatial overlapping is resolved through logic while building the VRT). But I suppose that use case would be choosing overlapping data preference by block level, not pixel level.
Thanks for the ping @TomNicholas! Some good points have already been mentioned. I think I just brought up VRTs because they are another example of lightweight sidecar metadata that simplifies the user experience of data management :) ... I haven't thought too much about integrations with virtualizarr, but some ideas below:
I suppose in the same way you create a reference file for NetCDF/DMR++ to bypass HDF and use Zarr instead, you could do the same for TIFF/VRT to bypass GDAL. Would probably want to do some benchmarking there, because unlike hdf, GDAL is pretty good at using overviews and efficiently figuring out range requests during reads (for the common case of a VRT pointing at cloud-optimized geotiffs).
I think another connection here is what is the serialization format for virtualizarr and what is it's scope? My understanding is the eventual goal is to save directly to ZARR v3 format and there are I'm sure lots of existing discussions that I'm not up to speed on. But my mental model is that VRT, STAC, ZARR, KerchunkJSON are all lightweight metadata mappings that can encode many things (file and byte locations, arbitrary metadata, "on read" computations like scale and offset, subset, reprojection).
It seems these lightweight mappings work well up to a limit, and then you encounter the need for some sort of spatial index or database system :) So again, my mapping becomes (KerchunkJSON -> Parquet, VRT -> GTI, STAC -> pgSTAC, ZARR -> Earthmover?
From https://docs.csc.fi/support/tutorials/gis/virtual-rasters/ (emphasis mine):
That sounds a lot like a set of reference files doesn't it... Maybe we could ingest those virtual raster files and turn them into chunk manifests, like we're doing with DMR++ in #113?
Also we can definitely open Cloud optimized GeoTIFFS now (since #162).
Thanks to @scottyhq for mentioning this idea. Maybe him, @abarciauskas-bgse, or someone else who knows more about GDAL can say whether they think this idea might actually work or not.