Fix double-shuffling and parallelism limit problem in featureBlockMatrix initialization

Fix parallelism limit problem in featureBlockMatrix initialization. When shuffle feature data and aggregate into featureBlockMatrix, change the partitioner to GridPartitionerV2 so it will have good parallelism, and avoiding shuffling again when doing the feature standardization, and also update code logic for the following feature standardization(change it from using zipPartition into using blockMatrixHorzZipVec).
Modify blockMatrixHorzZipVec & blockMatrixVertZipVec API returned type into RDD[((Int, Int), T)]

Existing test. VFUtilsSuite test updated.

yanboliang / spark-vlbfgs