Fix parallelism limit problem in featureBlockMatrix initialization.
When shuffle feature data and aggregate into featureBlockMatrix, change the partitioner to GridPartitionerV2 so it will have good parallelism, and avoiding shuffling again when doing the feature standardization, and also update code logic for the following feature standardization(change it from using zipPartition into using blockMatrixHorzZipVec).
Modify blockMatrixHorzZipVec & blockMatrixVertZipVec API returned type into RDD[((Int, Int), T)]
What changes were proposed in this pull request?
Fix parallelism limit problem in featureBlockMatrix initialization. When shuffle feature data and aggregate into featureBlockMatrix, change the partitioner to
GridPartitionerV2
so it will have good parallelism, and avoiding shuffling again when doing the feature standardization, and also update code logic for the following feature standardization(change it from usingzipPartition
into usingblockMatrixHorzZipVec
).Modify
blockMatrixHorzZipVec
&blockMatrixVertZipVec
API returned type intoRDD[((Int, Int), T)]
How was this patch tested?
Existing test.
VFUtilsSuite
test updated.