wdecoster / chopper

MIT License
135 stars 11 forks source link

minimap2 preset for `--contam` #35

Closed FatYuanBao closed 1 month ago

FatYuanBao commented 3 months ago

Hi developers, I think there is a small problem with --contam. In the main.rs line 235-242, minimap2 was deployed for alignment, but with the LrHq preset.

fn setup_contamination_filter(contam_fasta: &str, threads: &usize) -> Aligner {
    Aligner::builder()
        .preset(Preset::LrHq)
        .with_index_threads(*threads)
        .with_cigar()
        .with_index(contam_fasta, None)
        .expect("Unable to build index")
}

The LrHq preset is specifically for ONT R10 Q20 data, which is not every case for the users using chopper. Although R9 flowcells were obsolete, the majority of ONT data was sequenced with R9 flowcells. I suggest let the user to input R9 or R10 (maybe add a parameter --flowcell [R9]/[R10]) to determine to use the preset LrHq or map-ont.

JWDebler commented 3 months ago

Is there some other setting to contamination detaction? I ran a dataset with known DCS contamination through chopper and then through minimap directly. Chopper for my particular dataset let almost 1000 reads through that still map to DCS (over the whole length), while minimap2 directly with the lr:hq preset only let a handful of short reads through. At this point it is still letting too many DCS reads pass for my comfort. My commands respectively: chopper -i Ar15CUR005-high-sugar.simplex.fastq.gz --contam DCS.fasta --threads 16 > chopper.fastq

minimap2 -d dcs.mmi DCS.fasta && minimap2 -t 16 -ax lr:hq dcs.mmi Ar15CUR005-high-sugar.simplex.fastq.gz | samtools view -@ 16 -b -f4 - | samtools fastq -@ 16 - > minimap.fastq Chopper cleaned reads mapped to DCS: image Minimap2 + samtools: image

FatYuanBao commented 3 months ago

Hi @JWDebler, I wonder what flowcell was used to sequenced Ar15CUR005-high-sugar.simplex.fastq.gz? Is it R9 or R10? And also for the users using PacBio CCS, the mode lr:hq is not good.

JWDebler commented 3 months ago

R10

wdecoster commented 3 months ago

Hi everyone,

Thanks for your feedback.

@FatYuanBao: yes, your concern is justified. I however haven't evaluated how the alignment is for multiple chemistry versions and sequencers, with this preset. It is not because it is not optimal that it doesn't work at all... but further testing would be good. Regardless of what the companies will tell you, the differences between R10 ONT and PacBio CCS are not huge, anymore. Note that the lr:hq preset in chopper is only there since v0.8.0

@JWDebler: chopper is eliminating reads that are at least 90% aligned to the contaminating sequence (DCS then). Could that explain what you observe? I don't want to eliminate sequences that have only a short alignment between query and contaminant. I think false positives are worse than false negatives here, but that will depend on your application...

In general, while circumstances may warrant different parameters, I want to avoid to create a tool with too many options :-)

Best, Wouter

JWDebler commented 3 months ago

Yep, that probably explains it. I will stick with Minimap for removal then because I have seen DCS reads ending up as parts of contigs during assembly or as standalone contigs. Maybe you could expose this as another option, but it's not that urgent. I just pipe my decontaminated reads into chopper for quality and length filtering.

wdecoster commented 3 months ago

And you mean exposing the alignment percentage as an option, or the alignment preset?

FatYuanBao commented 3 months ago

It is not because it is not optimal that it doesn't work at all...

Yeah, that is really the point, but I still think using preset map-ont is better at this time for

  1. lr:hq is still in development, which may change later.
  2. map-ont works good on R9 nanopore data and many paper even use map-ont for R10 nanopore data.
JWDebler commented 3 months ago

In my case the percentage, others might be interested in the preset. But I agree that one of the strengths of chopper is its limited set of features and you can always just pipe your personal preferred Minimap results straight into it.

FatYuanBao commented 1 month ago

I will just close this issue first since --contam was not the core of chopper.