A couple changes to improve the Spark benchmark

granturing commented 8 years ago

Cody actually opened an issue, #4, for these suggestions already, but I thought I'd still go ahead and submit a PR.

Changing groupByKey to a reduceByKey helps reduce shuffle output as well as task memory usage for stage 1. Also, for the final output I simply use foreachRDD/foreachPartition rather than mapPartitions/count which ends up reducing from 3 to 2 stages.

Some of the comments mention using repartition to increase or decrease parallelization. As Cody mentioned, stage 1 parallelization is going to be limited by the number of Kafka partitions. Setting the PARTITIONS environment variable to a higher value definitely improved execution time for me. Using a Spark repartition would come at the cost of increased latency vs just increasing Kafka partitions.

revans2 commented 7 years ago

@granturing sorry this has taken so long. We are trying to pay more attention to this project.

That you so much for the contribution. I am +1 for merging this in, the only thing that seems to be missing is a single CLA. If you are still interested in contributing if you could follow the instructions you received earlier on submitting the CLA then I would be happy to merge this in.

yahoocla commented 7 years ago

CLA is valid!

revans2 commented 7 years ago

oops yahoocla just said that it is valid so +1. Thanks for the contribution.

yahoo / streaming-benchmarks

A couple changes to improve the Spark benchmark #6