z0on / tag-based_RNAseq

Cost-efficient genome-wide gene expression profiling
16 stars 9 forks source link

tagseq processing not recognizing fastq headers #3

Closed laurahspencer closed 4 years ago

laurahspencer commented 4 years ago

I'm following your tagSeq_processing_README.txt protocol to trim and filter reads, generated from QuantSeq libraries run on an Illumina NovaSeq platform this month. The output indicates that a very large portion of my reads do not have headers:

image

Upon inspection, the fastq files don't appear to lack headers, but I'm wondering if the tagseq_clipper.pl script is looking for a different header format? My headers are in the following format:

image

Here are abbreviated versions of an untrimmed file, and the trimmed file showing reads that passed the tagseq_clipper.pl script: example_files.zip

I admittedly am unfamiliar with perl scripts, so any help would be great.

z0on commented 4 years ago

hi laura - easily solvable! can you please poke me tomorrow if i forget to reply? in short, "header" is not the read title, but the lead 5' portion of the read used for de-duplication. do you have those in your quant-seq?

28 апр. 2020 г., в 22:01, Laura H Spencer notifications@github.com написал(а):

I'm following your tagSeq_processing_README.txt protocol to trim and filter reads, generated from QuantSeq libraries run on an Illumina NovaSeq platform this month. The output indicates that a very large portion of my reads do not have headers:

Upon inspection, the fastq files don't appear to lack headers, but I'm wondering if the tagseq_clipper.pl script is looking for a different header format? My headers are in the following format:

Here are abbreviated versions of an untrimmed file, and the trimmed file showing reads that passed the tagseq_clipper.pl script: example_files.zip

I admittedly am unfamiliar with perl scripts, so any help would be great.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

laurahspencer commented 4 years ago

Ah, good to know! The QuantSeq manual/FAQ doesn't indicate whether or not deduplication is necessary (below is a screen shot of their recommended trimming), but my data is single-read without UMIs, and from a couple things I've read online deduplication isn't recommended (or possible?) for this type of data. Let me know if you think otherwise!

Recommended trimming according to QuantSeq's FAQ:

image

z0on commented 4 years ago

Hi Laura - my position is that deduplication is always needed because otherwise your counts-based stats (like DESeq2) are not valid; plus it removes noise due to over-dispersion of amplified counts. That said, if you don't have means to deduplicate you have no choice. Fortunately, it is still OK to publish stuff based on non-deduped data!

so why do you want to use the tagseq pipeline, if I may ask?.. there is really nothing special to it, except maybe deduplication :) What is the reference you are going to map to?

cheers Misha

On Thu, Apr 30, 2020 at 1:38 PM Laura H Spencer notifications@github.com wrote:

Ah, good to know! The QuantSeq manual/FAQ doesn't indicate whether or not deduplication is necessary (below is a screen shot of their recommended trimming), but my data is single-read without UMIs, and from a couple things I've read online deduplication isn't recommended (or possible?) for this type of data. Let me know if you think otherwise!

Recommended trimming according to QuantSeq's FAQ https://www.lexogen.com/quantseq-3mrna-sequencing/#quantseqfaq:

[image: image] https://user-images.githubusercontent.com/17264765/80649399-c82cf000-8a26-11ea-967f-bd91d21584d0.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/3#issuecomment-622029484, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGARKQ4J6LZGT5WV7OTRPHARTANCNFSM4MTLFU6A .

laurahspencer commented 4 years ago

Hi Misha- I used your pipeline back in fall 2018 on some pilot QuantSeq data, at the suggestion of a colleague. It worked well then, but I don't think you had incorporated deduplication yet (?). I will probably depart from your process a bit, now that I more fully understand what your pipeline is intended for. I will align data to the Olympia oyster (Ostrea lurida) genome, which my lab developed.

Regarding deduplication, that's interesting to know, and I'll definitely have to do more reading on the matter. I'm now wondering if there is a tool I can use to identify duplicates based on the read sequences themselves (i.e. identical sequences), despite not having paired data or molecular identifiers... if you know of any, please let me know! Thanks for all you help!

z0on commented 4 years ago

Hi Laura - I see! If you map to genome, my pipeline is really not too useful. Just use any mapper of your choice and then featureCounts to extract counts (you might wish to adjust your genome’s GFF file extend gene regions 1-2kb towards 3’ ; otherwise gene annotations are often missing the non-coding 3’regions where our reads are mapping)

Yes, you can mark duplicates just based on reads, using Picard tool. Still, since in quant-seq your reads will be piled up in a relatively narrow region near 3', there is a danger of over-deduping (i.e. some reads might legitimately map to the same place because there is not much choice where they could map). Check in IGV viewer how your read pile-ups look.

(both IGV viewer and Picard are tools by Broad institute)

cheers Misha

On May 1, 2020, at 3:29 PM, Laura H Spencer notifications@github.com wrote:

Hi Misha- I used your pipeline back in fall 2018 on some pilot QuantSeq data, at the suggestion of a colleague. It worked well then, but I don't think you had incorporated deduplication yet (?). I will probably depart from your process a bit, now that I more fully understand what your pipeline is intended for. I will align data to the Olympia oyster (Ostrea lurida) genome, which my lab https://faculty.washington.edu/sr320/ developed.

Regarding deduplication, that's interesting to know, and I'll definitely have to do more reading on the matter. I'm now wondering if there is a tool I can use to identify duplicates based on the read sequences themselves (i.e. identical sequences), despite not having paired data or molecular identifiers... if you know of any, please let me know! Thanks for all you help!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/3#issuecomment-622554576, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGGDUHSCUQK7MXMPHPLRPMWMBANCNFSM4MTLFU6A.