yanboliang / spark-vlbfgs

Vector-free L-BFGS implementation for Spark MLlib
Apache License 2.0
46 stars 17 forks source link

Implement vlogistic regression based on VectorFreeLBFGS #3

Closed WeichenXu123 closed 7 years ago

WeichenXu123 commented 7 years ago

What changes were proposed in this pull request?

Implement vlogistic regression based on VectorFreeLBFGS, This algorithm is designed for data training with billions of dimensions .

Design

This algorithm should handling extremely large input data, and we should make sure it won't cause node memory overflow and the algorithm must be parallelized in all algorithm stages.

Internal storage format for features data The features data, each point feature contains billions of dimensions, but it is very sparse, the proper design is to split the whole features matrix into distributed block matrix. and each block is a sparse matrix. (we may support block sparse matrix compression later). The input data will be shuffled to generate this distributed block matrix and the corresponding label and weight vector (also distributed) by the way.

Input data standardization Before training, the algorithm implementation will first standardize the input data (before generating the distributed block matrix), it will make the training reaching convergence faster.

DiffFunction implement The DiffFunction will calculate the gradient and function value of LOR objective. Here it need to be parallelized completely, that is, the input is a distributed vector, and output is another distributed vector(gradient) and a double value(function value). And the calculation will use the distributed block matrix (data features) we just generated, and also the corresponding weight and label vector(also distributed). In order to implement the calculation described above in parallelized way, here I use a new RDD transformation mapJoinPartitions, implemented here: https://github.com/yanboliang/spark-vlbfgs/pull/1 It use two pass calculation, first pass it calculate the multiplier for each labeled point, and accumulate the function value by the way, the second pass it combine the labeled point using the multiplier to generate the distributed gradient vector.

Parameters This V-LOR implementation add several new parameters, comparing with LOR in spark ML:

Currently the fitIntercept parameter haven't been supported. It need some research until I found the best way to add it into this V-LOR.

How was this patch tested?

VLogisticRegressionSuite testcase added.