Avoid use ArrayBuffer to create COO format matrix.

uncleGen commented 7 years ago

Before PR

As we do not know how many active feature in each block, we use the mutable scala array, i.e. ArrayBuffer, to hold row, column and value data.

After PR

We add a pre-work to count the active feature in each block. Then we can create determinate size array. This change can help to avoid doing once transform, i.e. from ArrayBuffer to Array, and then help to improve the heap memory usage issue, like gc.

Benchmark

Here is some test result in local env:

data set: 200 million (rows) x 1 billion (feature)
cluster setup: 20 x 32(cores) x 128GB
@ Alibaba Cloud E-MapReduce

\	before pr	after pr
Task duration (with gc issue)	2.3~2.8min	16~20s
GC duration	1.8~2.6min	<5s

Followups

new unit test for change?

Unit Test

all tests passed.

uncleGen commented 7 years ago

cc @yanboliang

uncleGen commented 7 years ago

@yanboliang any feedback?

yanboliang / spark-vlbfgs