wdecoster / chopper

MIT License
135 stars 11 forks source link

missing ".with_cigar()" in Aligner? #19

Closed wchengt closed 10 months ago

wchengt commented 10 months ago

Hello! Thank you for creating chopper for us. However, I noticed when I was trying to remove DCS reads from my fastq files that a good portion of contaminating reads still remain. This is an example of one read blasted against the DCS sequence. contam_read

(query) bad read: 3,800 bp (90% =3,420bp) (target) DCS: 3,560 bp

Chopper left these reads, so I decided to manually run minimap2 -ax map-ont DCS.fasta read.fq to see the PAF results. My "match_len" was 3,510bp. Please correct me if I'm misinterpreting the filter function, but I assume because 3,510bp > 3,420bp it should be classified as a contaminate.

Alternatively if i run minimap2 -x map-ont DCS.fasta read.fq my "match_len" was 3,268bp. Because it is not greater than 3,420bp the read would be retained. Could chopper be inaccurately reporting the lengths because the Aligner setup in lines: 178-184 is missing ".with_cigar()"? https://github.com/lh3/minimap2/issues/158

fn setup_contamination_filter(contam_fasta: &str) -> Aligner {
    Aligner::builder()
        .with_threads(8)
        .map_ont()
        .with_index(contam_fasta, None)
        .expect("Unable to build index")
}
wdecoster commented 10 months ago

Hi,

Thank you for reporting this! That is an interesting observation. As we don't use DNA CS, could you please test the attached binary to see if your proposed changes solve the problem?

Thanks, chopper-musl.zip

Wouter

wchengt commented 10 months ago

.with_cigar() fixed the problem! Normally I don't have to worry about DCS reads because I filter 5kb+ but the throughout of my current dataset required me to lower my threshold. Host read removal also looks better.

Thank you for the update 😃

wdecoster commented 10 months ago

Thanks for confirming, I will push out a new release now.

nmflack commented 6 months ago

Hello, I'm still experiencing this issue with version 0.7.0 (bioconda install on M1 Mac). minimap2 -aLyx map-ont aligns ~200 reads to the DNA CS sequence; chopper -c DCS.fa does not remove any of them. Other filtering flags (quality, length) are working as expected. Thanks for taking a look!

image image
wdecoster commented 6 months ago

Is this data you can share with me?

nmflack commented 6 months ago

Here is the the FASTQ and reference FASTA for DCS: barcode12.all.fastq.gz DCS.fasta.txt

wdecoster commented 6 months ago

Hi,

The requirement for the contamination filter is that at least 90% of the fragment aligns with the contaminant to avoid removing short aspecific matches that are in the read by chance. It appears that in your dataset, about 40-60% of the fragment aligns with the DNA-CS. Can you think of a reason for this? Are there long adapters or something like that?