wdecoster / chopper

MIT License
135 stars 11 forks source link

Suggestion #37

Open Potatoconomy opened 3 years ago

Potatoconomy commented 3 years ago

Hey! Thanks for the great program.

There were two things that would, in my eyes, really round out the utility of this tool.

1) Removal of PolyA/T tails. Only a subset of my reads still contain an A/T tail, and headclipping to remove this bias is also clipping the reads that have already had this section removed.

2) Nucleotide tail clipping on a subset of reads. Right now, it tailclips on all reads, however my fastqc report shows that I should only be tailclipping the longest reads in my fastq file.

Once again, thanks for the program!

wdecoster commented 3 years ago

Hi,

Thanks for the suggestions! I'll give them some thought, but have a question for each:

1) Do you suggest to remove 'exact' polyA/T tails (with only AAAAAAA or only TTTTTTTTTTTTTT) or (I assume the latter) rather also allow some noise in those stretches? 2) How do you think this should be implemented? Like having an option to --clip-when-length 10000 that the user can specify for which read length the clipping rules do apply?

Cheers, Wouter

Potatoconomy commented 3 years ago

Hey,

I am working with nanopore reads which have around a 15% error call with each nucleotide. For this reason, the noise would have to be accounted for, likely with a sliding window technique. Prinseq is a program that removes the exact polyA/T tails, which I use, but this still leaves me with a +10% (T) bias for the beginning of my reads. and a slight A bias at the end.

Right now, my pipeline is to do some trimming with NanoFilt and then follow that up with the A/T trimming with Prinseq. This has given me the least nucleotide bias so far, although there is still some present.

For the 2nd suggestion, I had actually misinterpreted my FastQC report and forgot that there were fewer reads with longer lengths, hence increasing the variance of my data in that region.

Thanks, Patrick

wdecoster commented 3 years ago

So that leaves us only with suggestion 1? Okay, I'll think about it how to best implement this.