Implement vlogistic regression based on VectorFreeLBFGS

What changes were proposed in this pull request?

Implement vlogistic regression based on VectorFreeLBFGS, This algorithm is designed for data training with billions of dimensions .

Design

This algorithm should handling extremely large input data, and we should make sure it won't cause node memory overflow and the algorithm must be parallelized in all algorithm stages.

Internal storage format for features data The features data, each point feature contains billions of dimensions, but it is very sparse, the proper design is to split the whole features matrix into distributed block matrix. and each block is a sparse matrix. (we may support block sparse matrix compression later). The input data will be shuffled to generate this distributed block matrix and the corresponding label and weight vector (also distributed) by the way.

Input data standardization Before training, the algorithm implementation will first standardize the input data (before generating the distributed block matrix), it will make the training reaching convergence faster.

DiffFunction implement The DiffFunction will calculate the gradient and function value of LOR objective. Here it need to be parallelized completely, that is, the input is a distributed vector, and output is another distributed vector(gradient) and a double value(function value). And the calculation will use the distributed block matrix (data features) we just generated, and also the corresponding weight and label vector(also distributed). In order to implement the calculation described above in parallelized way, here I use a new RDD transformation mapJoinPartitions, implemented here: https://github.com/yanboliang/spark-vlbfgs/pull/1 It use two pass calculation, first pass it calculate the multiplier for each labeled point, and accumulate the function value by the way, the second pass it combine the labeled point using the multiplier to generate the distributed gradient vector.

Parameters This V-LOR implementation add several new parameters, comparing with LOR in spark ML:

featuresPartSize represent the length of each part when the features splitted.
instanceStackSize represent how many labeled points was packed into one block matrix
blockMatrixRowPartNum represent how many partitions the distributed block matrix will be splitted in the row direction
blockMatrixColPartNum represent how many partitions the distributed block matrix will be splitted in the col direction
TODO

Currently the fitIntercept parameter haven't been supported. It need some research until I found the best way to add it into this V-LOR.

How was this patch tested?

VLogisticRegressionSuite testcase added.

yanboliang / spark-vlbfgs