Code is not working when data size is large

souvik82 commented 7 years ago

I have 4 node cluster - Master has 56 GB of RAM and Data nodes have 32 GB RAM. When I am going to use small data set like 200 MB. It works fine. When I am using 10 GB of Data, it is getting hang.

Sorting of data is horrible - It is taking too much time (For 10 GB - around 2.9 Mins). Below is my spark submit script:

spark-submit --jars /usr/lib/hbase-client-1.2.0-IBM-7.jar,/usr/lib/hbase-hadoop-compat-1.2.0-IBM-7.jar,/usr/lib/htrace-core-3.1.0-incubating.jar,/usr/lib/hbase-common-1.2.0-IBM-7.jar,/usr/lib/hbase-hadoop2-compat-1.2.0-IBM-7.jar,/usr/lib/hbase-protocol.jar,/usr/lib/hbase-server-1.2.0-IBM-7.jar,/usr/lib/metrics-core-2.2.0.jar,/usr/lib/hbase-annotations-1.2.0-IBM-7.jar --class testHbaseRDDUtil souvik-0.0.1-SNAPSHOT.jar --driver-memory 20G --executor-memory 4G --num-executors 32

zeyuanxy commented 7 years ago

Hello, sorting is accompanied with repartition, which means there is a lot of data transportation between machines, which is definitely a performance killer. Can you share the detailed metrics(or status) of your Spark Job?

zeyuanxy commented 7 years ago

Hi @souvik82 , I've added a new interface toHBaseBulkWithFamilies to let you define families rather than iterate through all the families, which would greatly improve performance when the data size is huge. Can you try it? Thanks~

ghost commented 7 years ago

How do I set the param numFilesPerRegionPerFamily? I have dozens of GB needs to be saved in HBase,,, 请问一下我该怎么设置numFilesPerRegionPerFamily的大小啊？我只知道我有十几个GB的数据要保存，这个数字应该怎么动态调节呢？

zeyuanxy / spark-hbase-bulk-loading

Code is not working when data size is large #1