Propagate errors to output in ForestChangeDiagnostic

Pull request checklist

Please check if your PR fulfills the following requirements:

[x] Make sure you are requesting to pull a topic/feature/bugfix branch (right side). Don't request your master!
[x] Make sure you are making a pull request against the develop branch (left side). Also you should start your branch off our develop.
[x] Check the commit's or even all commits' message styles matches our requested structure.
[ ] Check your code additions will fail neither code linting checks nor unit test.

Feature: Adding per-location error column to forest change diagnostic

Please check the type of change your PR introduces:

[ ] Bugfix
[x] Feature
[ ] Code style update (formatting, renaming)
[x] Refactoring (no functional changes, no api changes)
[ ] Build related changes
[ ] Documentation content changes
[ ] Other (please describe):

What is the current behavior?

Screen Shot 2021-08-27 at 9 42 50 AM

green: feature geometry
yellow: forest loss pixel grid (30m resolution)

Currently if processing a location a geometry somehow does not intersect with any of the raster pixel centers (if geometry is too small), the row will be dropped from the output. Note that this error is somewhat subtle, the geometry can intersect the raster itself but because it does not intersect a pixel center it will return empty result on polygonal summary. Using only pixel center intersections is critical to avoid double counting forest loss across multiple geometries.

Same behavior happens if there is an error reading raster tiles. While the error traces will be present in logs that missing records in the output are problematic.

What is the new behavior?

All locations present in input will be present in output. If there is an error while processing a location it will be included in a new column "location_error" which will be a JSON array of strings. Additionally new "status_code" column will include "3" in case of error and "2" in case of no error. These status codes match conventions on the requester and consumer side.

Does this introduce a breaking change?

[ ] Yes
[x] No

The effect is to have new columns, without changing names of selection of already existing columns. This change should not be breaking.

Other information

This PR introduces changes to the way intermediate and output DataFrame is written. Previous code relied on index based mapping on both the write side and the read side. This kind of scheme is brittle and I was experiencing effects of that while trying to introduce the error columns.

Spark DataFrame API already translates the names of the case class fields to columns. This PR changes the names of ForestChangeDiagnosticData to match the snake case of expected column output. This break scala convention a little but adds a lot of resiliency and requires to maintain less code. In related change the use of the ForestChangeDiagnosticSimpleRow and ForestChangeDiagnosticGridRow is no longer required. We're able to convert ForestChangeDiagnosticData directly and then select the columns we wish to have present in the output.

frameless new dependency and is used to derive the spark Encoders for ForestChangeDiagnosticData and map ForestChangeDiagnosticDataLossYearly and others to their JSON representation during the process of generating the DataFrame.

Functions that translate from one case class to another shifted from the Analysis file to be methods on case classes where they're most appropriate.

wri / gfw_forest_loss_geotrellis