yanboliang / spark-vlbfgs

Vector-free L-BFGS implementation for Spark MLlib
Apache License 2.0
46 stars 17 forks source link

[spark-vflbfgs] add mapJoinPartitions Operator (prepared for Vector-free LogisticRegression implementation) #1

Closed WeichenXu123 closed 7 years ago

WeichenXu123 commented 7 years ago

What changes were proposed in this pull request?

1. Add mapJoinPartitions Operator

this operator do something like map-side join between two RDD's partitions, but it provides more flexibility. The API defined as following:

def mapJoinPartition[B: ClassTag, V: ClassTag](rdd2: RDD[B])(
    idxF: (Int) => Array[Int],
    f: (Int, Iterator[A], Array[(Int, Iterator[B])]) => Iterator[V]
  )

this operator need two function parameter: idxF: (Int) => Array[Int] define the mapping relation, determine which partitions of RDD[B] will be mapped to the relative RDD[A] partition. The input parameter is RDD[A] partition id, and the output parameter is the partitions id-list of RDD[B], which will be mapped to the specified RDD[A] partition.

f: (Int, Iterator[A], Array[(Int, Iterator[B])]) => Iterator[V] is the user-defined calculation function. it contains 3 input parameter and 1 output iterator. parameter defined as following: f: (pid: Int, iterA: Iterator[A], iterBList: Array[(Int, Iterator[B])]) pid is the partition id of RDD[A] iterA is the content iterator of RDD[A]'s "pid" partition. iterBList is an Array and each element is a pair of (pid_RDDB, iterB) represent the pid of RDD[B] and the corresponding RDD[B] partition iterator.

The task preferred locations of mapJoinPartition operator will keep consistent with RDD[A], so that it will benefit from data locality, because usually RDD[A] will be much larger than RDD[B].

2. Add GridPartitionerV2

It is similar to the GridPartitioner in spark mllib, but here I remove the private[mllib] restriction on the class and change two members rowPartitions, colPartitions into public. And add two methods rowPartId, colPartId.

How was this patch tested?

Test class VFRDDFunctionsSuite added.