Add GC content parameter --mingc --maxgc

JMencius commented 7 months ago

Hi @wdecoster , Recently, I added parameters for GC content filter --mingc and --maxgc for my own project usage. I also added a testGC.fastq for testing under /chopper/test-data. It is very useful to deal with high-GC bacteria ONT sequencing. Hope u like my modifications.

wdecoster commented 7 months ago

Hi,

Thank you for contributing :-)

I am concerned that this makes things slower for users that don't need the GC filter. Maybe it should not be calculated if people are not filtering on it? Maybe you could benchmark if your changes made things slower.

And this is the key part of the failed test:

Best, Wouter

JMencius commented 7 months ago

Yeah, the performance is an issue. I will make it an optional parameter then. Sorry for the error, I forget to change it to f64 as 1.0 and 0.0. I will change acccordingly in the next commit.

JMencius commented 7 months ago

I updated the code to check if --mingc is not 0.0 or --maxgc is not 1.0 to see if user set the gc content filter. If the gc content filter is set, then calculate the gc content, otherwise just leave the gc content as 0.5 to pass the filter below. I also updated the cal_gc function to use .len() rather than count the lenght one by one. This should not affect the performace significantly if no gc content filter is set.

wdecoster commented 7 months ago

I have made some changes to how you implemented this. Would you agree that this is equivalent to your code? Before I merge this, I want to ensure that the way this is implemented does not majorly impact performance (by benchmarking the current release against this).

JMencius commented 7 months ago

Yeah, I will run some benchmark using huge fastq files.

JMencius commented 7 months ago

Hi @wdecoster I ran some tests on a 44G FASTQ file from Drosophila melanogaster ONT sequencing and found some interesting results.	Data	Size	command	Chopper 0.8.0 run time
DM.fastq	44G	cat DM.fastq \| ./chopper -q 10 -l 500 > DM_q10_l500.fastq	309.832 s	310.268 s
DM.fastq	44G	./chopper -q 10 -l 500 --input DM.fastq > DM_q10_l500.fastq	282.506 s	281.578 s
DM.fastq	44G	./chopper -q 10 -l 500 --input DM.fastq --mingc 0.5 > DM_q10_l500_gchalf.fastq	-	382.220 s
DM.fastq.gz	21G	gunzip -c DM.fastq.gz \| ./chopper -q 10 -l 500 > DM_q10_l500.fastq	652.202 s	675.225 s
DM.fastq.gz	21G	./chopper -q 10 -l 500 --input DM.fastq.gz > DM_q10_l500.fastq	3077.761 s	3060.424 s
DM.fastq.gz	21G	./chopper -q 10 -l 500 --input DM.fastq.gz --mingc 0.5 > DM_q10_l500_gchalf.fastq	-	2931.972 s

I found that the GC filter does not significantly affect the run time. The --input was also benchmarked, and for non-compressed fastq, --input has sightly better performance but for the .fastq.gz file is way below, I think the GC filter part is good but there is something to optimize for the straightforwad compressed file.

wdecoster commented 7 months ago

Oh yes indeed, that is interesting. I agree that the GC calculation doesn't slow things down when it is not needed, so that is great. Out of curiosity, did you also benchmark it when you did specify a min or max gc?

It seems gunzip does a far better job than the decompression done by Rust. I wonder if it is because you are then running the decompression in another process, piping the output, but I don't know how things get implemented in flate2. But things like this are also precisely the reason why I initially chopper only read from stdin. Unix tools are intended to do one thing well, and just one thing, and piping tools into the next tool is a great approach. I will add a note regarding your findings to the README.

JMencius commented 7 months ago

Yeah I added the test for --mingc implement above, I think the performace drop is not great for using only 4 cores.

I wonder if it is because you are then running the decompression in another process, piping the output, but I don't know how things get implemented in flate2

Maybe I will deal with this problem the --input performance problem for compressed file some day, to make use of multi-core decompressing. I suppose one can run pigz -d -c X.fastq.gz | chopper -q 10 -l 500 > X_q10_l500.fastq, right?

wdecoster / chopper

Add GC content parameter --mingc --maxgc #30