this operator need two function parameter:
idxF: (Int) => Array[Int] define the mapping relation, determine which partitions of RDD[B] will be mapped to the relative RDD[A] partition. The input parameter is RDD[A] partition id, and the output parameter is the partitions id-list of RDD[B], which will be mapped to the specified RDD[A] partition.
f: (Int, Iterator[A], Array[(Int, Iterator[B])]) => Iterator[V] is the user-defined calculation function. it contains 3 input parameter and 1 output iterator. parameter defined as following:
f: (pid: Int, iterA: Iterator[A], iterBList: Array[(Int, Iterator[B])])
pid is the partition id of RDD[A]
iterA is the content iterator of RDD[A]'s "pid" partition.
iterBList is an Array and each element is a pair of (pid_RDDB, iterB) represent the pid of RDD[B] and the corresponding RDD[B] partition iterator.
The task preferred locations of mapJoinPartition operator will keep consistent with RDD[A], so that it will benefit from data locality, because usually RDD[A] will be much larger than RDD[B].
2. Add GridPartitionerV2
It is similar to the GridPartitioner in spark mllib, but here I remove the private[mllib] restriction on the class and change two members rowPartitions, colPartitions into public.
And add two methods rowPartId, colPartId.
What changes were proposed in this pull request?
1. Add mapJoinPartitions Operator
this operator do something like map-side join between two RDD's partitions, but it provides more flexibility. The API defined as following:
this operator need two function parameter:
idxF: (Int) => Array[Int]
define the mapping relation, determine which partitions ofRDD[B]
will be mapped to the relativeRDD[A]
partition. The input parameter isRDD[A]
partition id, and the output parameter is the partitions id-list ofRDD[B]
, which will be mapped to the specifiedRDD[A]
partition.f: (Int, Iterator[A], Array[(Int, Iterator[B])]) => Iterator[V]
is the user-defined calculation function. it contains 3 input parameter and 1 output iterator. parameter defined as following:f: (pid: Int, iterA: Iterator[A], iterBList: Array[(Int, Iterator[B])])
pid is the partition id ofRDD[A]
iterA is the content iterator ofRDD[A]
's "pid" partition. iterBList is an Array and each element is a pair of (pid_RDDB, iterB) represent the pid ofRDD[B]
and the correspondingRDD[B]
partition iterator.The task preferred locations of
mapJoinPartition
operator will keep consistent withRDD[A]
, so that it will benefit from data locality, because usuallyRDD[A]
will be much larger thanRDD[B]
.2. Add GridPartitionerV2
It is similar to the
GridPartitioner
in spark mllib, but here I remove the private[mllib] restriction on the class and change two membersrowPartitions
,colPartitions
into public. And add two methodsrowPartId
,colPartId
.How was this patch tested?
Test class
VFRDDFunctionsSuite
added.