wyang17 / SQuIRE

Software for Quantifying Interspersed Repeat Expression
Other
48 stars 29 forks source link

does the gene ID collision in GTF affect the stranded libraries? #22

Open zhjilin opened 5 years ago

zhjilin commented 5 years ago

Hi, thanks for developing this cool tool! I'm trying to use mm10 (obtained using fetch ) to quantify some RNA-seq libraries, then I noticed a weird warning at the Count procedure :

Warning: gene "Gm20747" (on chrY) has reference transcripts on both strands?

I figured out this was caused by the gene ID collision in the mm10_refGene.gtf

chrY    squire_fetch/mm10_refGene.genepred      transcript      21164554        21166898        .       +       .       gene_id "Gm20747"; transcript_id "NM_001025241"; gene_name "Gm20747";
chrY    squire_fetch/mm10_refGene.genepred      transcript      53415957        53418307        .       -       .       gene_id "Gm20747"; transcript_id "NM_001025241_2"; gene_name "Gm20747";
chrY    squire_fetch/mm10_refGene.genepred      transcript      73313695        73316051        .       -       .       gene_id "Gm20747"; transcript_id "NM_001025241_3"; gene_name "Gm20747";
chrY    squire_fetch/mm10_refGene.genepred      transcript      81799150        81801497        .       -       .       gene_id "Gm20747"; transcript_id "NM_001025241_4"; gene_name "Gm20747";

since there are quite a few such cases, wondering how does this affect the stranded libraries? Possible to get a quick answer here before I go and check the code?

Another question is: do you have any descriptions of the input data preparation for each step somewhere ( such as the Clean procedure. too lazy to read the code, sorry :) )? Then I can quickly get some scripts to fix an Ensembl data converter.

Thanks!

wyang17 commented 5 years ago

Thanks!

1) This only affects quantification of genes by StringTie (manual here: https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual). There are multiple outputs, one of which (the gtf file I believe) quantifies by transcript which shouldn't be affected. The "abundannce" file may lump all counts together for that gene id.

None of this will affect TE quantification.

2) I'm not sure what you mean by descriptions of the input data preparation, sorry. The github site and manuscript may have what you want. For Clean, it's mainly converting the repeatmasker output into a BED file. If youexplain what you need with the Ensembl data I may be able to direct your answer better.

Hope this helps,

Wan Rou

StringTiehttps://ccb.jhu.edu/software/stringtie/index.shtml?t=manual ccb.jhu.edu Center for Computational Biology


From: zhjilin notifications@github.com Sent: Wednesday, February 6, 2019 12:00:04 PM To: wyang17/SQuIRE Cc: Subscribed Subject: [wyang17/SQuIRE] does the gene ID collision in GTF affect the stranded libraries? (#22)

Hi, thanks for developing this cool tool! I'm trying to use mm10 (obtained using fetch ) to quantify some RNA-seq libraries, then I noticed a weird warning at the Count procedure :

Warning: gene "Gm20747" (on chrY) has reference transcripts on both strands?

I figured out this was caused by the gene ID collision in the mm10_refGene.gtf

squire_fetch/mm10_refGene.genepred transcript 21164554 21166898 . + . gene_id "Gm20747"; transcript_id "NM_001025241"; gene_name "Gm20747"; chrY squire_fetch/mm10_refGene.genepred transcript 53415957 53418307 . - . gene_id "Gm20747"; transcript_id "NM_001025241_2"; gene_name "Gm20747"; chrY squire_fetch/mm10_refGene.genepred transcript 73313695 73316051 . - . gene_id "Gm20747"; transcript_id "NM_001025241_3"; gene_name "Gm20747"; chrY squire_fetch/mm10_refGene.genepred transcript 81799150 81801497 . - . gene_id "Gm20747"; transcript_id "NM_001025241_4"; gene_name "Gm20747";

since there are quite a few such cases, wondering how does this affect the stranded libraries? Possible to get a quick answer here before I go and check the code?

Another question is: do you have any descriptions of the input data preparation for each step somewhere ( such as the Clean procedure. too lazy to read the code, sorry :) )? Then I can quickly get some scripts to fix an Ensembl data converter.

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/wyang17/SQuIRE/issues/22, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGeSdaiy-RHZ2Z0wILCfs1UG_6VJunw_ks5vKwoUgaJpZM4alrTv.

zhjilin commented 5 years ago

Hi, Thanks for the quick reply. My apology for my vague description of the second question. I want to use the ENSEMBL gene annotation instead. I assume I can directly use the Clean script if I prepared the ensembl data into the same format with the files got though Fetch (haven't tried yet). Otherwise, some details of the processing in the Clean step would be a great help, so I don't have to go through the codes.

Thanks!

On Wed, Feb 6, 2019 at 6:28 PM wyang17 notifications@github.com wrote:

Thanks!

1) This only affects quantification of genes by StringTie (manual here: https://ccb.jhu.edu/software/stringtie/index.shtml?t=manual). There are multiple outputs, one of which (the gtf file I believe) quantifies by transcript which shouldn't be affected. The "abundannce" file may lump all counts together for that gene id.

None of this will affect TE quantification.

2) I'm not sure what you mean by descriptions of the input data preparation, sorry. The github site and manuscript may have what you want. For Clean, it's mainly converting the repeatmasker output into a BED file. If youexplain what you need with the Ensembl data I may be able to direct your answer better.

Hope this helps,

Wan Rou

StringTiehttps://ccb.jhu.edu/software/stringtie/index.shtml?t=manual ccb.jhu.edu Center for Computational Biology


From: zhjilin notifications@github.com Sent: Wednesday, February 6, 2019 12:00:04 PM To: wyang17/SQuIRE Cc: Subscribed Subject: [wyang17/SQuIRE] does the gene ID collision in GTF affect the stranded libraries? (#22)

Hi, thanks for developing this cool tool! I'm trying to use mm10 (obtained using fetch ) to quantify some RNA-seq libraries, then I noticed a weird warning at the Count procedure :

Warning: gene "Gm20747" (on chrY) has reference transcripts on both strands?

I figured out this was caused by the gene ID collision in the mm10_refGene.gtf

squire_fetch/mm10_refGene.genepred transcript 21164554 21166898 . + . gene_id "Gm20747"; transcript_id "NM_001025241"; gene_name "Gm20747"; chrY squire_fetch/mm10_refGene.genepred transcript 53415957 53418307 . - . gene_id "Gm20747"; transcript_id "NM_001025241_2"; gene_name "Gm20747"; chrY squire_fetch/mm10_refGene.genepred transcript 73313695 73316051 . - . gene_id "Gm20747"; transcript_id "NM_001025241_3"; gene_name "Gm20747"; chrY squire_fetch/mm10_refGene.genepred transcript 81799150 81801497 . - . gene_id "Gm20747"; transcript_id "NM_001025241_4"; gene_name "Gm20747";

since there are quite a few such cases, wondering how does this affect the stranded libraries? Possible to get a quick answer here before I go and check the code?

Another question is: do you have any descriptions of the input data preparation for each step somewhere ( such as the Clean procedure. too lazy to read the code, sorry :) )? Then I can quickly get some scripts to fix an Ensembl data converter.

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub< https://github.com/wyang17/SQuIRE/issues/22>, or mute the thread< https://github.com/notifications/unsubscribe-auth/AGeSdaiy-RHZ2Z0wILCfs1UG_6VJunw_ks5vKwoUgaJpZM4alrTv

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/wyang17/SQuIRE/issues/22#issuecomment-461111873, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsDiEroEX-ah0fkDOKmkxDzYvtbS1lEks5vKxDRgaJpZM4alrTv .