My apologies for not opening an issue first; this PR arose out of some experimentation and I thought it would be easier to describe the issue with the code to refer to.
This PR does two things:
Configure the global thread pool (.build_global()) rather than a local one (.build()) to ensure that subsequent calls to Rayon parallelization routines use the correct number of threads. Without this change, the --threads argument had essentially no effect for me (rayon would select an optimal number of threads, ignoring the specified value).
Swap out the use of mutexes for atomic integers. After configuring the global thread pool I noticed there was almost no parallellization, due to the frequent locking and unlocking of the number of reads processed/output. Using atomic integers removes this bottleneck.
This is somewhat of an experimental approach, but I see the following timings (macOS, m1) while filtering a 400 MB fastq file with differing numbers of threads (time target/debug/chopper -l 200 -q 7 --threads $N < in.fastq > out.fastq). Specifically, I looked at 1, 2, 4, 8 threads and recorded the following timings:
Code in master: 8.97s, 8.96s, 8.92s, 8.94s
With .build_global() and using mutexes: 9.13s, 9.52s, 9.17s, 8.90s
With .build_global() and atomics (this PR): 9.32s, 5.01s, 3.53s, 3.61s
My apologies for not opening an issue first; this PR arose out of some experimentation and I thought it would be easier to describe the issue with the code to refer to.
This PR does two things:
.build_global()
) rather than a local one (.build()
) to ensure that subsequent calls to Rayon parallelization routines use the correct number of threads. Without this change, the--threads
argument had essentially no effect for me (rayon would select an optimal number of threads, ignoring the specified value).This is somewhat of an experimental approach, but I see the following timings (macOS, m1) while filtering a 400 MB fastq file with differing numbers of threads (
time target/debug/chopper -l 200 -q 7 --threads $N < in.fastq > out.fastq
). Specifically, I looked at 1, 2, 4, 8 threads and recorded the following timings:.build_global()
and using mutexes: 9.13s, 9.52s, 9.17s, 8.90s.build_global()
and atomics (this PR): 9.32s, 5.01s, 3.53s, 3.61s