Add OptimMultivariateOnlineSummarizer, allow user to specify what to summarize.
For example, in VLogisticRegression currently only need to statistic variance, so that using OptimMultivariateOnlineSummarizer it won't allocate other arrays for statistic min/max and so on.
In sparse input data statistic case, this optimization will reduce the memory cost and shuffle data size obviously.
API design
Add a mask parameter to OptimMultivariantOnlineSummerizer constructor.
it can be the following values now:
What changes were proposed in this pull request?
Add
OptimMultivariateOnlineSummarizer
, allow user to specify what to summarize. For example, in VLogisticRegression currently only need to statisticvariance
, so that usingOptimMultivariateOnlineSummarizer
it won't allocate other arrays for statisticmin/max
and so on.In sparse input data statistic case, this optimization will reduce the memory cost and shuffle data size obviously.
API design
Add a mask parameter to
OptimMultivariantOnlineSummerizer
constructor. it can be the following values now:specify the mask parameter to tell
OptimMultivariantOnlineSummerizer
which statistic values we actually need.This is similar to my spark PR created here: https://github.com/apache/spark/pull/14950
How was this patch tested?
OptimMultivariateOnlineSummarizer
added.