Closed granturing closed 7 years ago
@granturing sorry this has taken so long. We are trying to pay more attention to this project.
That you so much for the contribution. I am +1 for merging this in, the only thing that seems to be missing is a single CLA. If you are still interested in contributing if you could follow the instructions you received earlier on submitting the CLA then I would be happy to merge this in.
CLA is valid!
oops yahoocla just said that it is valid so +1. Thanks for the contribution.
Cody actually opened an issue, #4, for these suggestions already, but I thought I'd still go ahead and submit a PR.
Changing groupByKey to a reduceByKey helps reduce shuffle output as well as task memory usage for stage 1. Also, for the final output I simply use foreachRDD/foreachPartition rather than mapPartitions/count which ends up reducing from 3 to 2 stages.
Some of the comments mention using repartition to increase or decrease parallelization. As Cody mentioned, stage 1 parallelization is going to be limited by the number of Kafka partitions. Setting the PARTITIONS environment variable to a higher value definitely improved execution time for me. Using a Spark repartition would come at the cost of increased latency vs just increasing Kafka partitions.