wri / gfw_forest_loss_geotrellis

Global Tree Cover Loss Analysis using Geotrellis and SPARK
MIT License
10 stars 8 forks source link

Propagate errors to output in ForestChangeDiagnostic #105

Closed echeipesh closed 3 years ago

echeipesh commented 3 years ago

Pull request checklist

Please check if your PR fulfills the following requirements:

Feature: Adding per-location error column to forest change diagnostic

Please check the type of change your PR introduces:

What is the current behavior?

Screen Shot 2021-08-27 at 9 42 50 AM

Currently if processing a location a geometry somehow does not intersect with any of the raster pixel centers (if geometry is too small), the row will be dropped from the output. Note that this error is somewhat subtle, the geometry can intersect the raster itself but because it does not intersect a pixel center it will return empty result on polygonal summary. Using only pixel center intersections is critical to avoid double counting forest loss across multiple geometries.

Same behavior happens if there is an error reading raster tiles. While the error traces will be present in logs that missing records in the output are problematic.

What is the new behavior?

All locations present in input will be present in output. If there is an error while processing a location it will be included in a new column "location_error" which will be a JSON array of strings. Additionally new "status_code" column will include "3" in case of error and "2" in case of no error. These status codes match conventions on the requester and consumer side.

Does this introduce a breaking change?

The effect is to have new columns, without changing names of selection of already existing columns. This change should not be breaking.

Other information

This PR introduces changes to the way intermediate and output DataFrame is written. Previous code relied on index based mapping on both the write side and the read side. This kind of scheme is brittle and I was experiencing effects of that while trying to introduce the error columns.

Spark DataFrame API already translates the names of the case class fields to columns. This PR changes the names of ForestChangeDiagnosticData to match the snake case of expected column output. This break scala convention a little but adds a lot of resiliency and requires to maintain less code. In related change the use of the ForestChangeDiagnosticSimpleRow and ForestChangeDiagnosticGridRow is no longer required. We're able to convert ForestChangeDiagnosticData directly and then select the columns we wish to have present in the output.

frameless new dependency and is used to derive the spark Encoders for ForestChangeDiagnosticData and map ForestChangeDiagnosticDataLossYearly and others to their JSON representation during the process of generating the DataFrame.

Functions that translate from one case class to another shifted from the Analysis file to be methods on case classes where they're most appropriate.

echeipesh commented 3 years ago

Compared output to develop branch and verified it to be fine. Note that tree_cover_loss_intact_forest_yearly output will differ but its because develop was wrong.