Fix several bottlenecks in data pre-processing and some other improvements

Improve feature block generation efficiency, add parameter generatingFeatureMatrixBuffer.
Improve feature summarizing computing, reduce memory cost and avoid OOM.
Improve standardized features block computing. Reduce memory cost and shuffling data size.
add parameter rowPartitionSplitNumOnGeneratingFeatureMatrix, used to increase feature block goupby reducer number (it will help avoid possible OOM in some cases)
Other minor updates and examples updates.

yanboliang / spark-vlbfgs