Closed JMencius closed 7 months ago
Hi,
Thank you for contributing :-)
I am concerned that this makes things slower for users that don't need the GC filter. Maybe it should not be calculated if people are not filtering on it? Maybe you could benchmark if your changes made things slower.
And this is the key part of the failed test:
Best, Wouter
Yeah, the performance is an issue. I will make it an optional parameter then.
Sorry for the error, I forget to change it to f64 as 1.0
and 0.0
.
I will change acccordingly in the next commit.
I updated the code to check if --mingc
is not 0.0 or --maxgc
is not 1.0 to see if user set the gc content filter. If the gc content filter is set, then calculate the gc content, otherwise just leave the gc content as 0.5 to pass the filter below.
I also updated the cal_gc
function to use .len()
rather than count the lenght one by one.
This should not affect the performace significantly if no gc content filter is set.
I have made some changes to how you implemented this. Would you agree that this is equivalent to your code? Before I merge this, I want to ensure that the way this is implemented does not majorly impact performance (by benchmarking the current release against this).
Yeah, I will run some benchmark using huge fastq
files.
Hi @wdecoster I ran some tests on a 44G FASTQ file from Drosophila melanogaster ONT sequencing and found some interesting results. | Data | Size | command | Chopper 0.8.0 run time | Add GC filter Chopper run time |
---|---|---|---|---|---|
DM.fastq | 44G | cat DM.fastq | ./chopper -q 10 -l 500 > DM_q10_l500.fastq | 309.832 s | 310.268 s | |
DM.fastq | 44G | ./chopper -q 10 -l 500 --input DM.fastq > DM_q10_l500.fastq | 282.506 s | 281.578 s | |
DM.fastq | 44G | ./chopper -q 10 -l 500 --input DM.fastq --mingc 0.5 > DM_q10_l500_gchalf.fastq | - | 382.220 s | |
DM.fastq.gz | 21G | gunzip -c DM.fastq.gz | ./chopper -q 10 -l 500 > DM_q10_l500.fastq | 652.202 s | 675.225 s | |
DM.fastq.gz | 21G | ./chopper -q 10 -l 500 --input DM.fastq.gz > DM_q10_l500.fastq | 3077.761 s | 3060.424 s | |
DM.fastq.gz | 21G | ./chopper -q 10 -l 500 --input DM.fastq.gz --mingc 0.5 > DM_q10_l500_gchalf.fastq | - | 2931.972 s |
I found that the GC filter does not significantly affect the run time. The --input
was also benchmarked, and for non-compressed fastq, --input
has sightly better performance but for the .fastq.gz
file is way below, I think the GC filter part is good but there is something to optimize for the straightforwad compressed file.
Oh yes indeed, that is interesting. I agree that the GC calculation doesn't slow things down when it is not needed, so that is great. Out of curiosity, did you also benchmark it when you did specify a min or max gc?
It seems gunzip does a far better job than the decompression done by Rust. I wonder if it is because you are then running the decompression in another process, piping the output, but I don't know how things get implemented in flate2. But things like this are also precisely the reason why I initially chopper only read from stdin. Unix tools are intended to do one thing well, and just one thing, and piping tools into the next tool is a great approach. I will add a note regarding your findings to the README.
Yeah I added the test for --mingc
implement above, I think the performace drop is not great for using only 4 cores.
I wonder if it is because you are then running the decompression in another process, piping the output, but I don't know how things get implemented in flate2
Maybe I will deal with this problem the --input
performance problem for compressed file some day, to make use of multi-core decompressing. I suppose one can run pigz -d -c X.fastq.gz | chopper -q 10 -l 500 > X_q10_l500.fastq
, right?
Hi @wdecoster , Recently, I added parameters for GC content filter
--mingc
and--maxgc
for my own project usage. I also added atestGC.fastq
for testing under/chopper/test-data
. It is very useful to deal with high-GC bacteria ONT sequencing. Hope u like my modifications.