Closed danscales closed 9 months ago
I think the collects could potentially have some bad performance impact unfortunately.
Also seems like a fairly complicated method to check this - we're the ones who filter out geometries that don't intersect TCL in GridRDD, and this method has us then later checking which ones we filtered out with a diff. Could we just modify GridRDD to return which features are intersecting and non-intersecting instead of filtering them out? Then FCD analysis can set those with your error.
Thanks for the review!
We may want to talk individually or in the huddle. We actually get the features that intersect the TCL extent by using sedona's JoinQuery.SpatialJoinQueryFlat(). It returns pairs of (1x1 grid box, feature), showing which 1x1 grid boxes intersect which features. But it doesn't return the features that don't intersect any grid boxes. So, that is why I used the approach I did with the hash table. Note also that the information that is being passed back to the driver is just the (list id, location id), so even for inputs with 10,000 locations, you might only be sending 80K data back to the driver. But I agree that the distinct()
call could be expensive.
This also feels like perhaps something better to filter out before submitting to geotrellis. I know Pro already has a lambda that clips features to land, could that also just clip features to the TCL extent? Then they can tell users their geometry is invalid before they submit the list to geotrellis in case they want to edit it rather than waiting an hour to see if fail.
Yes, I will check with Kinshuk and Andy, see if they can be can clip to the TCL extent, rather than just to land. Thanks for the info!
We may want to talk individually or in the huddle. We actually get the features that intersect the TCL extent by using sedona's JoinQuery.SpatialJoinQueryFlat(). It returns pairs of (1x1 grid box, feature), showing which 1x1 grid boxes intersect which features. But it doesn't return the features that don't intersect any grid boxes. So, that is why I used the approach I did with the hash table. Note also that the information that is being passed back to the driver is just the (list id, location id), so even for inputs with 10,000 locations, you might only be sending 80K data back to the driver. But I agree that the
distinct()
call could be expensive.
Ah sorry, I think was misinterpreting how GridRDD worked. If the distinct() call doesn't cause performance issues we could go with this. It does still feel a little clunky to me to filter out geometries and then backwards determine what we filtered out, rather just keeping track of that while we filter. It doesn't seem like Sedona supports an outer join. There is a parameter clip
you could set to false
to not clip to TCL extent during the join in that analysis. Then you can propagate the NoIntersectionError later, or just add a check afterwards for the joined RDD against the TCL extent geometry (just map the joined RDD geometries and do an intersection on each). Not sure if that'd be more performant though.
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
e092462
) 18.15% compared to head (c6271c4
) 18.17%.
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
GTC-2651 Add error lines for any locations that don't overlap TCL extent
We separate out the locations that don't intersect the extent of the Tree Cover Loss dataset at all, and give those an entry in the final output with an error. GFWPro folks said that would be useful to flag invalid locations. We also deal with the case where no locations intersect the TCL extent (i.e. the set of valid locations is empty).
This does involve collecting intersecting location ids back to the driver node near the beginning of the analysis, but I don't think that should be significant traffic compared to the rest of the analysis. But let me know if you think this is too expensive, or if you have suggestion on a better way to do this in Geotrellis/Spark.
Added comments in various places as I understood the code.