Open jpolchlo opened 9 months ago
This is still a work-in-progress, but I'm having trouble finding the time off the clock to push this further. This compiles, but is still encountering some runtime errors in the layer reader (and probably elsewhere that I haven't seen yet). The strategy is to keep forcing computation with either count
or show
on the various intermediate products, and debug each stage as needed. I only wanted to get something to you all to pick at.
The current implementation of the AFi stats processor, much like most of the stats currently in use are based on a fairly convoluted foundation of classes that are rooted in Spark's RDD API. While still the underpinning of modern Spark, it requires more rigorous hand optimization than the modern Spark SQL machinery (which includes the optimizing Catalyst engine).
This PR offers an alternative implementation of the AFi summary statistic command. This is a nearly wholesale rewrite of the statistic infrastructure, which refactors the layer infrastructure, and tries to step away from the byzantine class hierarchy which currently defines the shape of the stats processors. It isn't letter-perfect, but shows how we can move away from the hard-to-decipher and hard-to-maintain code structure that exists to something that tries to fit a more SQL-like work pattern that may be more familiar to data scientists.
The patterns adopted in this example reimplementation have the goals of (1) making the data flow as explicit and understandable as possible, (2) threading error handling through the entire pipeline, (3) providing a viable template for other processing tasks, and (4) doing the above with modern Scala usage.
The focal point here is
AFiAnalysis.scala
. This module should be understood to be a chain of processing stages:Many of these stages will find application in other stat processors, only requiring that the sections be extracted, lightly generalized, and organized into modules.
The niceness of this implementation is somewhat marred by the requirement to keep good track of errors, but I'm introducing the
Maybe
abstraction to help manage the complexity. Any operation that can potentially fail can be wrapped in aMaybe
which provides a three-value logic of sorts. We can have (1) failed computations which encapsulate an error message, (2) successful computations encapsulating a value, or (3) non-erroring results with no value. The third alternative should be used sparingly. Once a column is computed with Maybe values, it can be unpacked, its errors being merged with a pre-existing error column. The error column defines the validity of any row for use in further calculations, and we can use thewhenValid
operator to help define derived column values only for non-erring rows (to avoid null values). The existence of null values does lead to some troubles, but note that Spark will often interpret nullable columns asOption
-valued as a convenience. The goal should be to containMaybe
-valued columns to be handled entirely within a given processing stage to maximize the portability of these computations.This is an initial foray into improving the clarity of the derivation of stats, and should be seen as a model and not a final product. Further work into benchmarking will also be needed to confirm that this is a viable implementation. More effort beyond that will also be needed to validate the results of these computations against what exists.