Splash consistency with Spark's RDD guarantees

Reviewing Splash's code, I notice quite a number of places where a workset is modified in an RDD#foreach or RDD#map operation. This of course works fine when every change that is made is kept in memory and all memory is retained by Spark.

However, AFAICS this is a poor match to Spark's fault tolerance guarantees. E.g. #foreach operation is assumed by Spark to not do any mutations. This means that it is free to discard a copy of data that is also available on disk, whether a foreach loop iterated over it or not. When records in the RDD have been changed "behind Spark's back", results will differ depending on whether there was a GC or e.g. a node crashed.

Now, perhaps there's a good reason as to why this is not an issue for the approach Splash takes. I would certainly be curious to know under which conditions it is possible to do in-memory mutations without telling Spark - and still get the same fault tolerance guarantees.

zhangyuc / splash

Splash consistency with Spark's RDD guarantees #7