yanboliang / spark-vlbfgs

Vector-free L-BFGS implementation for Spark MLlib
Apache License 2.0
46 stars 17 forks source link

Avoid use ArrayBuffer to create COO format matrix. #38

Closed uncleGen closed 7 years ago

uncleGen commented 7 years ago

Before PR

As we do not know how many active feature in each block, we use the mutable scala array, i.e. ArrayBuffer, to hold row, column and value data.

After PR

We add a pre-work to count the active feature in each block. Then we can create determinate size array. This change can help to avoid doing once transform, i.e. from ArrayBuffer to Array, and then help to improve the heap memory usage issue, like gc.

Benchmark

Here is some test result in local env:

\ before pr after pr
Task duration (with gc issue) 2.3~2.8min 16~20s
GC duration 1.8~2.6min <5s

Followups

new unit test for change?

Unit Test

all tests passed.

uncleGen commented 7 years ago

cc @yanboliang

uncleGen commented 7 years ago

@yanboliang any feedback?