Open souvik82 opened 7 years ago
Hello, sorting is accompanied with repartition, which means there is a lot of data transportation between machines, which is definitely a performance killer. Can you share the detailed metrics(or status) of your Spark Job?
Hi @souvik82 , I've added a new interface toHBaseBulkWithFamilies to let you define families rather than iterate through all the families, which would greatly improve performance when the data size is huge. Can you try it? Thanks~
How do I set the param numFilesPerRegionPerFamily? I have dozens of GB needs to be saved in HBase,,, 请问一下我该怎么设置numFilesPerRegionPerFamily的大小啊?我只知道我有十几个GB的数据要保存,这个数字应该怎么动态调节呢?
I have 4 node cluster - Master has 56 GB of RAM and Data nodes have 32 GB RAM. When I am going to use small data set like 200 MB. It works fine. When I am using 10 GB of Data, it is getting hang.
Sorting of data is horrible - It is taking too much time (For 10 GB - around 2.9 Mins). Below is my spark submit script:
spark-submit --jars /usr/lib/hbase-client-1.2.0-IBM-7.jar,/usr/lib/hbase-hadoop-compat-1.2.0-IBM-7.jar,/usr/lib/htrace-core-3.1.0-incubating.jar,/usr/lib/hbase-common-1.2.0-IBM-7.jar,/usr/lib/hbase-hadoop2-compat-1.2.0-IBM-7.jar,/usr/lib/hbase-protocol.jar,/usr/lib/hbase-server-1.2.0-IBM-7.jar,/usr/lib/metrics-core-2.2.0.jar,/usr/lib/hbase-annotations-1.2.0-IBM-7.jar --class testHbaseRDDUtil souvik-0.0.1-SNAPSHOT.jar --driver-memory 20G --executor-memory 4G --num-executors 32