This seems to run fast now, and this seemed like the simplest solution. There were a few issues causing slowness.
The main 8+ hour slowness was from doing the spatial join for fires with the full sized dissolved geometry along with all the list geometries inside of it. GeoSpark does quadtree partitioning via random sampling, assigns each partition a grid, and then checks if each feature intersects that grid. If a large feature intersects multiple grids, it'll actually just duplicate the geometry in each partition and it internally drops the duplicate results later at the end of the spatial join. So given that the dissolved geometry will intersect every list geometry, I think the slowness was that it's duplicating this huge geometry across partitions, and then doing the full join with fire points in each of those partitions.
The solution here was to make sure the dissolved geometry is split. We could do that either by re-using the enriched GADM intersection RDD from the first analysis, or just using the split_geometries flag.
Another slowness I noticed was that, when we try to create a spatial RDD from a feature RDD using the rawSpatialRDD, the analyze() method can take almost an hour. Honestly not sure why, all this method does is merge the bounding boxes of features, I think there might be an issue with GeoSpark. But I think when we read the geometry as Spatial DF and then convert to a Spatial RDD, it already has this metadata, so the analyze method is much faster (so splitGeometries analyze() only takes a minute).
Consequently, the simplest and fastest method I think is to just ALWAYS split_geometries for this analysis, and then this resolves all the issues GeoSpark has with massive geometries. Let me know if you think this might cause issues with the results, but as far as I can tell, we always merge back by feature ID anyway.
Pull request checklist
Please check if your PR fulfills the following requirements:
[X] Make sure you are requesting to pull a topic/feature/bugfix branch (right side). Don't request your master!
[X] Make sure you are making a pull request against the develop branch (left side). Also you should start your branch off our develop.
[X] Check the commit's or even all commits' message styles matches our requested structure.
[X] Check your code additions will fail neither code linting checks nor unit test.
Pull request type
Please check the type of change your PR introduces:
[X] Bugfix
[ ] Feature
[ ] Code style update (formatting, renaming)
[ ] Refactoring (no functional changes, no api changes)
[ ] Build related changes
[ ] Documentation content changes
[ ] Other (please describe):
What is the current behavior?
Large dissolved geometries will cause incredibly slow dashboard analysis runs for lists that should only take like 20-30 minutes.
Issue Number: N/A
What is the new behavior?
Re-use input featureRDD for all dashboard analyses instead of creating a new DF
Only create spatialFeatureRDD once, since the analyze() method can be quite slow
NOTE: now you should always use the --split_geometries flag for dashboard analysis, especially if including the dissolved geometry
This seems to run fast now, and this seemed like the simplest solution. There were a few issues causing slowness.
The main 8+ hour slowness was from doing the spatial join for fires with the full sized dissolved geometry along with all the list geometries inside of it. GeoSpark does quadtree partitioning via random sampling, assigns each partition a grid, and then checks if each feature intersects that grid. If a large feature intersects multiple grids, it'll actually just duplicate the geometry in each partition and it internally drops the duplicate results later at the end of the spatial join. So given that the dissolved geometry will intersect every list geometry, I think the slowness was that it's duplicating this huge geometry across partitions, and then doing the full join with fire points in each of those partitions.
The solution here was to make sure the dissolved geometry is split. We could do that either by re-using the enriched GADM intersection RDD from the first analysis, or just using the split_geometries flag.
Another slowness I noticed was that, when we try to create a spatial RDD from a feature RDD using the rawSpatialRDD, the analyze() method can take almost an hour. Honestly not sure why, all this method does is merge the bounding boxes of features, I think there might be an issue with GeoSpark. But I think when we read the geometry as Spatial DF and then convert to a Spatial RDD, it already has this metadata, so the analyze method is much faster (so splitGeometries analyze() only takes a minute).
Consequently, the simplest and fastest method I think is to just ALWAYS split_geometries for this analysis, and then this resolves all the issues GeoSpark has with massive geometries. Let me know if you think this might cause issues with the results, but as far as I can tell, we always merge back by feature ID anyway.
Pull request checklist
Please check if your PR fulfills the following requirements:
Pull request type
Please check the type of change your PR introduces:
What is the current behavior?
Large dissolved geometries will cause incredibly slow dashboard analysis runs for lists that should only take like 20-30 minutes.
Issue Number: N/A
What is the new behavior?
Does this introduce a breaking change?