yanboliang / spark-vlbfgs

Vector-free L-BFGS implementation for Spark MLlib
Apache License 2.0
46 stars 17 forks source link

optimize MultivariateOnlineSummarizer to reduce memory cost and shuffle data size #7

Closed WeichenXu123 closed 7 years ago

WeichenXu123 commented 7 years ago

What changes were proposed in this pull request?

Add OptimMultivariateOnlineSummarizer, allow user to specify what to summarize. For example, in VLogisticRegression currently only need to statistic variance, so that using OptimMultivariateOnlineSummarizer it won't allocate other arrays for statistic min/max and so on.

In sparse input data statistic case, this optimization will reduce the memory cost and shuffle data size obviously.

API design

Add a mask parameter to OptimMultivariantOnlineSummerizer constructor. it can be the following values now:

meanMask
varianceMask
minMask
maxMask
numNonZerosMask

specify the mask parameter to tell OptimMultivariantOnlineSummerizer which statistic values we actually need.

This is similar to my spark PR created here: https://github.com/apache/spark/pull/14950

How was this patch tested?

OptimMultivariateOnlineSummarizer added.