weecology / retriever

Quickly download, clean up, and install public datasets into a database management system
http://data-retriever.org
Other
306 stars 134 forks source link

Add Soil data macrosys #1547

Open henrykironde opened 3 years ago

henrykironde commented 3 years ago

Soil water content (volumetric %) for 33kPa and 1500kPa suctions predicted at 6 standard depths (0, 10, 30, 60, 100 and 200 cm) at 250 m resolution

source https://zenodo.org/record/2784001#.YDlJ02pKiBR or https://developers.google.com/earth-engine/datasets/catalog/OpenLandMap_SOL_SOL_WATERCONTENT-33KPA_USDA-4B1C_M_v01

citation: "Tomislav Hengl, & Surya Gupta. (2019). Soil water content (volumetric %) for 33kPa and 1500kPa suctions predicted at 6 standard depths (0, 10, 30, 60, 100 and 200 cm) at 250 m resolution (Version v0.1) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.2784001"

License (for files): Creative Commons Attribution Share Alike 4.0 International

henrykironde commented 3 years ago

@MarconiS https://gitlab.com/openlandmap/global-layers

MarconiS commented 3 years ago

Here is the link to the Zenodo archive for all derived datasets of global soil properties (0.065km2 spatial resolution)

Aakash3101 commented 3 years ago

Are these datasets added to scripts in retriever-recipes ? If not then I would like to solve this issue.

henrykironde commented 3 years ago

@Aakash3101 feel free to work on the issue. Recommend that you start from down to up

Aakash3101 commented 3 years ago

Sure @henrykironde

Aakash3101 commented 3 years ago

@henrykironde I wanted to clear a doubt, In the last dataset "Soil available water capacity in mm derived for 5 standard layers", I can make a single script for all the files in the dataset, right? The dataset has 7 files so when I run retriever autocreate I can have all the files in the same directory ?

Aakash3101 commented 3 years ago

Also shall I make separate commits for each dataset or a combined commit?

henrykironde commented 3 years ago

I can make a single script for all the files in the dataset, right? The dataset has 7 files so when I run retriever autocreate I can have all the files in the same directory ?

Yes all the files in the same directory. In this case, I think a fitting name for the directory would be Soil_available_water_capacity

Aakash3101 commented 3 years ago

@henrykironde I think this PR can be completed during my GSOC project, if I get selected, Because these files are very big indeed 😂, and I might take time to check each one, and then make a PR for the dataset added.

henrykironde commented 3 years ago

Each checkbox is a single PR, I am actually working on them so do worry about the whole issue. Your goal should be to understand or get a good overview of the moving parts in the project.

Aakash3101 commented 3 years ago

Each checkbox is a single PR, I am actually working on them so do worry about the whole issue. Your goal should be to understand or get a good overview of the moving parts in the project.

Yes, actually I am enjoying doing this kind of work as I am learning new things.

Aakash3101 commented 3 years ago

@henrykironde I am not able to load the .tif files into postgresql. There is some kind of limitation of size for raster2pgsql to work efficiently. raster2pgsql works completely fine with small files, but it is just stuck when I run it for the big files which are around 3 to 4 GB.

henrykironde commented 3 years ago

I will check this out

Aakash3101 commented 3 years ago

I will check this out

Well I am also figuring out something, and it turns out that the tile size can impact the processing time. In the code for the install command the tile size is 100x100, and when I tried for tile size 2000x2000, the file was saved in the database, but I cannot view it in QGIS. Both pgadmin4 and DB manager in QGIS show that the table does have raster values.

Aakash3101 commented 3 years ago

I will check this out

Any updates @henrykironde? To me, it seems that when a tile size of 100x100 is used, a lot of rows will be generated with the tile size. For example, the size of this file is 172800x71698

aakash01@aakash01-G3-3579:~/.retriever/raw_data/soil-available-water-capacity $ gdalinfo sol_available.water.capacity_usda.mm_m_250m_30..60cm_1950..2017_v0.1.tif 
Driver: GTiff/GeoTIFF
Files: sol_available.water.capacity_usda.mm_m_250m_30..60cm_1950..2017_v0.1.tif
Size is 172800, 71698
Coordinate System is:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Origin = (-180.000000000000000,87.370000000000005)
Pixel Size = (0.002083333000000,-0.002083333000000)
Metadata:
  AREA_OR_POINT=Area
Image Structure Metadata:
  COMPRESSION=DEFLATE
  INTERLEAVE=BAND
Corner Coordinates:
Upper Left  (-180.0000000,  87.3700000) (180d 0' 0.00"W, 87d22'12.00"N)
Lower Left  (-180.0000000, -62.0008094) (180d 0' 0.00"W, 62d 0' 2.91"S)
Upper Right ( 179.9999424,  87.3700000) (179d59'59.79"E, 87d22'12.00"N)
Lower Right ( 179.9999424, -62.0008094) (179d59'59.79"E, 62d 0' 2.91"S)
Center      (  -0.0000288,  12.6845953) (  0d 0' 0.10"W, 12d41' 4.54"N)
Band 1 Block=172800x1 Type=Int16, ColorInterp=Gray
  NoData Value=-32768
  Overviews: 86400x35849, 43200x17925, 21600x8963, 10800x4482, 5400x2241, 2700x1121, 1350x561

When I run the raster2pgsql command for a tile size of 100x100, it takes an indefinite time to process, while for tile sizes 2000x2000 or 5000x5000 it takes about 40 mins - 1hour. But the problem is when I try to view the raster through QGIS it seems to add the layer to the canvas and then it crashes after 10 mins or so.

One another way to deal with this processing time issue is that if we reference the file to the database using the -R flag of raster2pgsql command, using this flag only the reference will be stored and not the raster data into the database.

But this will impact the reason why we are storing it in the database in the first place because if the file is moved from the destination it should be in, the reference would not work. I had the idea for the -R flag because since the raw data is downloaded when you first install the dataset, and it does not get deleted, so if we reference the data, it would save the user some storage on the system.

henrykironde commented 3 years ago

@Aakash3101 what are your computational resources?

Aakash3101 commented 3 years ago

@Aakash3101 what are your computational resources?

CPU : i7 8th Gen GPU: GeForce GTX 1050 Ti Ram: 8GB DDR4 GPU Ram: 4GB OS: Ubuntu 20.04 LTS

henrykironde commented 3 years ago

Could you try to close other applications(especially IDEs), open QGIS and try to load the map. I will try it later from my end. Give it a few minutes to render.

Aakash3101 commented 3 years ago

I can load and view the map from the raw data file, but not from the PostGIS database.

henrykironde commented 3 years ago

Yes load the data from PostGIS database and give it at least 10 minutes based on your resources. Make sure we free at least 4 gb of memory. Most Ides will take about 2gb. Closing them will enable QGIS load the data

Aakash3101 commented 3 years ago

Okay, I will let you know if it opens.

Aakash3101 commented 3 years ago

So this time while loading the file in QGIS, I monitored my RAM usage through the terminal and it uses all my memory. And then the application is terminated. I don't know the reasons, but I will soon find out.

Aakash3101 commented 3 years ago

And when I open the raw data file, it uses just around 2GB of my RAM. I think that the memory usage is caused by PostGIS in the background by running queries or something.

Aakash3101 commented 3 years ago

When I query the table in pgadmin4 to show all the values in the table, postgres uses all the RAM, and then it freezes, so I think I need to optimize the memory available for queries. Please let me know if you find something useful to optimize the memory usage.

henrykironde commented 3 years ago

Okey I think at this point you should let me handle this. It could take at least one day or two. I will try to find a way around. This is at a good point/phase. I will update you. I need to finish up with some other spatial datasets first

Aakash3101 commented 3 years ago

Sure @henrykironde