Implement vlogistic regression based on VectorFreeLBFGS, This algorithm is designed for data training with billions of dimensions .
Design
This algorithm should handling extremely large input data, and we should make sure it won't cause node memory overflow and the algorithm must be parallelized in all algorithm stages.
Internal storage format for features data
The features data, each point feature contains billions of dimensions, but it is very sparse, the proper design is to split the whole features matrix into distributed block matrix. and each block is a sparse matrix. (we may support block sparse matrix compression later). The input data will be shuffled to generate this distributed block matrix and the corresponding label and weight vector (also distributed) by the way.
Input data standardization
Before training, the algorithm implementation will first standardize the input data (before generating the distributed block matrix), it will make the training reaching convergence faster.
DiffFunction implement
The DiffFunction will calculate the gradient and function value of LOR objective. Here it need to be parallelized completely, that is, the input is a distributed vector, and output is another distributed vector(gradient) and a double value(function value). And the calculation will use the distributed block matrix (data features) we just generated, and also the corresponding weight and label vector(also distributed).
In order to implement the calculation described above in parallelized way, here I use a new RDD transformation mapJoinPartitions, implemented here:
https://github.com/yanboliang/spark-vlbfgs/pull/1
It use two pass calculation, first pass it calculate the multiplier for each labeled point, and accumulate the function value by the way, the second pass it combine the labeled point using the multiplier to generate the distributed gradient vector.
Parameters
This V-LOR implementation add several new parameters, comparing with LOR in spark ML:
featuresPartSize represent the length of each part when the features splitted.
instanceStackSize represent how many labeled points was packed into one block matrix
blockMatrixRowPartNum represent how many partitions the distributed block matrix will be splitted in the row direction
blockMatrixColPartNum represent how many partitions the distributed block matrix will be splitted in the col direction
TODO
Currently the fitIntercept parameter haven't been supported. It need some research until I found the best way to add it into this V-LOR.
What changes were proposed in this pull request?
Implement vlogistic regression based on VectorFreeLBFGS, This algorithm is designed for data training with billions of dimensions .
Design
This algorithm should handling extremely large input data, and we should make sure it won't cause node memory overflow and the algorithm must be parallelized in all algorithm stages.
Internal storage format for features data The features data, each point feature contains billions of dimensions, but it is very sparse, the proper design is to split the whole features matrix into distributed block matrix. and each block is a sparse matrix. (we may support block sparse matrix compression later). The input data will be shuffled to generate this distributed block matrix and the corresponding label and weight vector (also distributed) by the way.
Input data standardization Before training, the algorithm implementation will first standardize the input data (before generating the distributed block matrix), it will make the training reaching convergence faster.
DiffFunction implement The
DiffFunction
will calculate the gradient and function value of LOR objective. Here it need to be parallelized completely, that is, the input is a distributed vector, and output is another distributed vector(gradient) and a double value(function value). And the calculation will use the distributed block matrix (data features) we just generated, and also the corresponding weight and label vector(also distributed). In order to implement the calculation described above in parallelized way, here I use a new RDD transformationmapJoinPartitions
, implemented here: https://github.com/yanboliang/spark-vlbfgs/pull/1 It use two pass calculation, first pass it calculate themultiplier
for each labeled point, and accumulate the function value by the way, the second pass it combine the labeled point using themultiplier
to generate the distributed gradient vector.Parameters This V-LOR implementation add several new parameters, comparing with LOR in spark ML:
TODO
Currently the
fitIntercept
parameter haven't been supported. It need some research until I found the best way to add it into this V-LOR.How was this patch tested?
VLogisticRegressionSuite testcase added.