As we do not know how many active feature in each block, we use the mutable scala array, i.e. ArrayBuffer, to hold row, column and value data.
After PR
We add a pre-work to count the active feature in each block. Then we can create determinate size array. This change can help to avoid doing once transform, i.e. from ArrayBuffer to Array, and then help to improve the heap memory usage issue, like gc.
Benchmark
Here is some test result in local env:
data set: 200 million (rows) x 1 billion (feature)
Before PR
As we do not know how many active feature in each block, we use the mutable scala array, i.e. ArrayBuffer, to hold row, column and value data.
After PR
We add a pre-work to count the active feature in each block. Then we can create determinate size array. This change can help to avoid doing once transform, i.e. from ArrayBuffer to Array, and then help to improve the heap memory usage issue, like gc.
Benchmark
Here is some test result in local env:
Followups
new unit test for change?
Unit Test
all tests passed.