wri / gfw_forest_loss_geotrellis

Global Tree Cover Loss Analysis using Geotrellis and SPARK
MIT License
10 stars 8 forks source link

Re-use input feature RDD, only create spatial RDD once, to make use o… #93

Closed jterry64 closed 3 years ago

jterry64 commented 3 years ago

This seems to run fast now, and this seemed like the simplest solution. There were a few issues causing slowness.

The main 8+ hour slowness was from doing the spatial join for fires with the full sized dissolved geometry along with all the list geometries inside of it. GeoSpark does quadtree partitioning via random sampling, assigns each partition a grid, and then checks if each feature intersects that grid. If a large feature intersects multiple grids, it'll actually just duplicate the geometry in each partition and it internally drops the duplicate results later at the end of the spatial join. So given that the dissolved geometry will intersect every list geometry, I think the slowness was that it's duplicating this huge geometry across partitions, and then doing the full join with fire points in each of those partitions.

The solution here was to make sure the dissolved geometry is split. We could do that either by re-using the enriched GADM intersection RDD from the first analysis, or just using the split_geometries flag.

Another slowness I noticed was that, when we try to create a spatial RDD from a feature RDD using the rawSpatialRDD, the analyze() method can take almost an hour. Honestly not sure why, all this method does is merge the bounding boxes of features, I think there might be an issue with GeoSpark. But I think when we read the geometry as Spatial DF and then convert to a Spatial RDD, it already has this metadata, so the analyze method is much faster (so splitGeometries analyze() only takes a minute).

Consequently, the simplest and fastest method I think is to just ALWAYS split_geometries for this analysis, and then this resolves all the issues GeoSpark has with massive geometries. Let me know if you think this might cause issues with the results, but as far as I can tell, we always merge back by feature ID anyway.

Pull request checklist

Please check if your PR fulfills the following requirements:

Pull request type

Please check the type of change your PR introduces:

What is the current behavior?

Large dissolved geometries will cause incredibly slow dashboard analysis runs for lists that should only take like 20-30 minutes.

Issue Number: N/A

What is the new behavior?

Does this introduce a breaking change?