TagSeq pipeline trouble - Generating read-counts-per-gene

LibraLizard66 commented 5 years ago

Hi all,

I'm trying to assay differential gene expression from 22 RNA samples, two of them sequenced with RNAseq and 20 of them sequenced with TagSeq. I used the RNAseq sequences to generate a reference transcriptome de novo via Trinity to compare the TagSeq sequences to.

I've been following the pipeline listed here: https://github.com/z0on/tag-based_RNAseq/blob/master/tagSeq_processing_README.txt Everything went smoothly until the "# generating read-counts-per gene" section. Here's a rundown of what I did and where things appeared wrong (Skip to Step 4 if you want to go straight to that).

STEP 1 I used the command: grep ">" Trinity.fasta | perl -pe 's/^>(\S+)|(\S+)(i\d+)\s.+/$1|$2$3\t$1$2/' >transcriptome_seq2iso.tab in order to create a two-column tab-delimited table to give correspondence between entries in the transcriptome fasta file and genes (our transcriptome fasta file was named "Trinity.fasta", not "trinity.fasta" as per the readme file). The file "transcriptome_seq2iso.tab" was created.

STEP 2 The pipeline afterwards lists some steps to take if you don't have any assembler-derived genes (cd-hit-est). As we had plenty of assembler-derived-genes, I planned on skipping it, but isogroup_namer.pl (in the next step) seems to depend on having ch-hit-est. Thus, I ran cd-hit-est using the following code: cdhit-est -i Trinity.fasta -o transcriptome_clust.fasta -c 0.99 -G 0 -aL 0.3 -aS 0.3 The following two files were created 1) transcriptome_clust.fasta and 2) transcriptome_clust.fasta.clustr

STEP 3 To try to add cluster designations to the fasta headers, I used the code: ~/tag-based_RNAseq/isogroup_namer.pl transcriptome_clust.fasta transcriptome_clust.fasta.clstr >transcriptome_seq2gene.tab Note: The "~/tag-based_RNAseq" portion of the code was to tell Ubuntu where isogroup_namer.pl was. The following three files were created: 1) transcriptome_clust_iso.fasta 2) transcriptome_clust_iso.seq2iso.tab 3) transcriptome_seq2gene.tab However, when I examined these files with nano, all of the files were empty.

STEP 4 (THIS IS WHERE THINGS HAVE APPARENTLY GONE WRONG) To try to count the hits per isogroup, I ran the following code: /home/genomics/tag-based_RNAseq/samcount_launch_bt2.pl '.sam' ~/Data/devin/JA19193/transcriptome_seq2iso.tab > sc Note: "/home/genomics/tag-based_RNAseq" pointed Ubuntu to samcount_launch_bt2.pl, and "~/Data/devin?JA19193" pointed Ubuntu towards transcriptome_seq2.iso.tab. After doing this, I used nano to edit sc to add the full path to samcount before each sample so Ubuntu could find samcount. For example: samcount.pl 10_merged.trim.sam /home/genomics/Data/devin/JA19193/transcriptome_seq2iso.tab aligner=bowtie2 >10_merged.trim.sam.counts became ~/tag-based_RNAseq/samcount.pl 10_merged.trim.sam /home/genomics/Data/devin/JA19193/transcriptome_seq2iso.tab aligner=bowtie2 >10_merged.trim.sam.counts Note: "10_merged" is the name of a TagSeq sample.

I then executed sc using the following code: bash sc

samcount ran, but in every line of output while it ran, the message, "[gene name] has no isogroup designation” appeared. I tried creating the sc file using the file "transcriptome_seq2gene.tab" (the 3rd file in Step 3 above) and running samcount with this modified sc, but received the same result. Thus, it appears that samcount is treating the only meaningful product file from isogroup_namer as an empty file. Can anyone give me some help? I'm running Ubuntu version 16.04.

Thanks, -Devin

z0on commented 5 years ago

Hi Devin - With Trinity you don’t need steps 2 and 3, they are only for cases when you don’t have assembler-derived genes. Was the file “transcriptome_seq2iso.tab” non-empty? looking reasonable? Can you paste the first 10 lines? Mish

On Sep 10, 2019, at 4:20 PM, LibraLizard66 notifications@github.com wrote:

Hi all,

I'm trying to assay differential gene expression from 22 RNA samples, two of them sequenced with RNAseq and 20 of them sequenced with TagSeq. I used the RNAseq sequences to generate a reference transcriptome de novo via Trinity to compare the TagSeq sequences to.

I've been following the pipeline listed here: https://github.com/z0on/tag-based_RNAseq/blob/master/tagSeq_processing_README.txt https://github.com/z0on/tag-based_RNAseq/blob/master/tagSeq_processing_README.txt Everything went smoothly until the "# generating read-counts-per gene" section. Here's a rundown of what I did and where things appeared wrong (Skip to Step 4 if you want to go straight to that).

STEP 1 I used the command: grep ">" Trinity.fasta | perl -pe 's/^>(\S+)|(\S+)(i\d+)\s.+/$1|$2$3\t$1$2/' >transcriptome_seq2iso.tab in order to create a two-column tab-delimited table to give correspondence between entries in the transcriptome fasta file and genes (our transcriptome fasta file was named "Trinity.fasta", not "trinity.fasta" as per the readme file). The file "transcriptome_seq2iso.tab" was created.

STEP 2 The pipeline afterwards lists some steps to take if you don't have any assembler-derived genes (cd-hit-est). As we had plenty of assembler-derived-genes, I planned on skipping it, but isogroup_namer.pl (in the next step) seems to depend on having ch-hit-est. Thus, I ran cd-hit-est using the following code: cdhit-est -i Trinity.fasta -o transcriptome_clust.fasta -c 0.99 -G 0 -aL 0.3 -aS 0.3 The following two files were created 1) transcriptome_clust.fasta and 2) transcriptome_clust.fasta.clustr

STEP 3 To try to add cluster designations to the fasta headers, I used the code: /tag-based_RNAseq/isogroup_namer.pl transcriptome_clust.fasta transcriptome_clust.fasta.clstr >transcriptome_seq2gene.tab Note: The "/tag-based_RNAseq" portion of the code was to tell Ubuntu where isogroup_namer.pl was. The following four files were created: 1) transcriptome_clust_iso.fasta 2) transcriptome_clust_iso.seq2iso.tab 3) transcriptome_seq2gene.tab and 4) transcriptome_seq2iso.tab However, when I examined these files with nano, all of the files were empty except for transcriptome_seq2iso.tab (the 4th one), which, as far as I could tell, looked like it should.

STEP 4 (THIS IS WHERE THINGS HAVE APPARENTLY GONE WRONG) To try to count the hits per isogroup, I ran the following code: /home/genomics/tag-based_RNAseq/samcount_launch_bt2.pl '.sam' /Data/devin/JA19193/transcriptome_seq2iso.tab > sc Note: "/home/genomics/tag-based_RNAseq" pointed Ubuntu to samcount_launch_bt2.pl, and "/Data/devin?JA19193" pointed Ubuntu towards transcriptome_seq2.iso.tab. After doing this, I used nano to edit sc to add the full path to samcount before each sample so Ubuntu could find samcount. For example: samcount.pl 10_merged.trim.sam /home/genomics/Data/devin/JA19193/transcriptome_seq2iso.tab aligner=bowtie2 >10_merged.trim.sam.counts became ~/tag-based_RNAseq/samcount.pl 10_merged.trim.sam /home/genomics/Data/devin/JA19193/transcriptome_seq2iso.tab aligner=bowtie2 >10_merged.trim.sam.counts Note: "10_merged" is the name of a TagSeq sample.

I then executed sc using the following code: bash sc

samcount ran, but in every line of output while it ran, the message, "[gene name] has no isogroup designation” appeared. I tried creating the sc file using the file "transcriptome_seq2gene.tab" (the 3rd file in Step 3 above) and running samcount with this modified sc, but received the same result. Thus, it appears that samcount is treating the only meaningful product file from isogroup_namer as an empty file. Can anyone give me some help? I'm running Ubuntu version 16.04.

Thanks, -Devin

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/1?email_source=notifications&email_token=ABZUHGDQGMRQDNWO4ECE65LQJAFRHA5CNFSM4IVMWOB2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HKR3WTQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZUHGCRTPOSD66W6DS677LQJAFRHANCNFSM4IVMWOBQ.

LibraLizard66 commented 5 years ago

Sure thing. Here are the first 10 lines from transcriptome_seq2iso.tab that I made in Step 1:

TRINITY_DN4600_c0_g1_i4 len=1014 path=[0:0-45 2:46-268 3:269-349 5:350-1013] TRINITY_DN4600_c0_g1_i1 len=933 path=[0:0-45 2:46-268 5:269-932] TRINITY_DN4600_c0_g1_i5 len=1208 path=[1:0-911 2:912-1134 4:1135-1207] TRINITY_DN4601_c0_g1_i1 len=2125 path=[0:0-190 1:191-275 2:276-2124] TRINITY_DN4601_c0_g1_i2 len=2040 path=[0:0-190 2:191-2039] TRINITY_DN4602_c0_g1_i3 len=2322 path=[0:0-66 3:67-169 5:170-2321] TRINITY_DN4602_c0_g1_i4 len=2487 path=[0:0-66 1:67-133 2:134-231 3:232-334 5:335-2486] TRINITY_DN4602_c0_g1_i1 len=2338 path=[0:0-66 3:67-169 4:170-185 5:186-2337] TRINITY_DN4602_c0_g1_i2 len=2389 path=[0:0-66 1:67-133 3:134-236 5:237-2388] TRINITY_DN4602_c0_g2_i1 len=405 path=[0:0-404]

-Devin

z0on commented 5 years ago

This looks wrong, it should be a two-column table. Try this instead of Step 1:

grep ">" transcriptome.fasta | perl -pe 's/>((TRINITY.+_g\d+)\S+)/$1\t$2/' >transcriptome_seq2iso.tab

Then move to samcount.pl step.

LibraLizard66 commented 5 years ago

That got rid of the 'no isogroup designation' error while running. However, when running sc, I got many notices saying either "disregarding reads mapping to multiple isogroups" or "cannot find alignment score in" (sic). Do you think this is ok?

For reference, I've provided the first ten lines of allcounts.txt that resulted from assembling all of the counts into a single table using: /home/genomics/tag-based_RNAseq/expression_compiler.pl *.sam.counts > allcounts.txt

    10_merged.trim.sam.counts       11C_merged.trim.sam.counts      12C_merged.trim.sam.counts      13_merged.trim.sam.counts       14_merged.trim.sam.counts       15C_merged.trim.sam.counts      15_merged.trim.sam.counts       17_merged.trim.sam.counts       19_merged.trim.sam.counts       27_merged.trim.sam.counts       29_merged.trim.sam.counts       2_merged.trim.sam.counts        37_merged.trim.sam.counts       3C_merged.trim.sam.counts       4C_merged.tr$

TRINITY_DN0_c0_g1 1 0 3 1 1 3 1 0 1 2 3 1 0 2 6 3 1 2 2 0 TRINITY_DN0_c1_g1 130 90 78 81 58 184 110 78 157 69 69 91 108 64 97 125 131 96 213 154 TRINITY_DN10000_c0_g1 0 1 5 0 1 0 0 2 0 4 3 1 0 0 6 2 0 2 0 0 TRINITY_DN10000_c1_g1 0 0 0 0 1 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 TRINITY_DN10000_c1_g2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 TRINITY_DN10000_c1_g3 0 0 0 0 0 0 0 0 2 0 1 1 0 0 0 0 0 0 0 0 TRINITY_DN10001_c0_g1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TRINITY_DN10002_c0_g1 5 5 9 3 4 3 3 6 5 6 7 9 2 4 3 3 3 3 6 3

z0on commented 5 years ago

This looks correct!

On Sep 15, 2019, at 5:23 PM, LibraLizard66 notifications@github.com wrote:

That got rid of the 'no isogroup designation' error while running. However, when running sc, I got many notices saying either "disregarding reads mapping to multiple isogroups" or "cannot find alignment score in" (sic). Do you think this is ok?

For reference, I've provided the first ten lines of allcounts.txt that resulted from assembling all of the counts into a single table using: /home/genomics/tag-based_RNAseq/expression_compiler.pl *.sam.counts > allcounts.txt
10_merged.trim.sam.counts       11C_merged.trim.sam.counts      12C_merged.trim.sam.counts      13_merged.trim.sam.counts       14_merged.trim.sam.counts       15C_merged.trim.sam.counts      15_merged.trim.sam.counts       17_merged.trim.sam.counts       19_merged.trim.sam.counts       27_merged.trim.sam.counts       29_merged.trim.sam.counts       2_merged.trim.sam.counts        37_merged.trim.sam.counts       3C_merged.trim.sam.counts       4C_merged.tr$
TRINITY_DN0_c0_g1 1 0 3 1 1 3 1 0 1 2 3 1 0 2 6 3 1 2 2 0 TRINITY_DN0_c1_g1 130 90 78 81 58 184 110 78 157 69 69 91 108 64 97 125 131 96 213 154 TRINITY_DN10000_c0_g1 0 1 5 0 1 0 0 2 0 4 3 1 0 0 6 2 0 2 0 0 TRINITY_DN10000_c1_g1 0 0 0 0 1 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 TRINITY_DN10000_c1_g2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 TRINITY_DN10000_c1_g3 0 0 0 0 0 0 0 0 2 0 1 1 0 0 0 0 0 0 0 0 TRINITY_DN10001_c0_g1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TRINITY_DN10002_c0_g1 5 5 9 3 4 3 3 6 5 6 7 9 2 4 3 3 3 3 6 3

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/1?email_source=notifications&email_token=ABZUHGFNFACSWXYEQX42UCTQJ2YXBA5CNFSM4IVMWOB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6X2IDI#issuecomment-531604493, or mute the thread https://github.com/notifications/unsubscribe-auth/ABZUHGG4DQWLFLLI6BXYSIDQJ2YXBANCNFSM4IVMWOBQ.

LibraLizard66 commented 5 years ago

Thanks very much for your help!

paigeduffin commented 2 years ago

Can you please describe what the code below does? grep ">" transcriptome.fasta | perl -pe 's/>((TRINITY.+_g\d+)\S+)/$1\t$2/' >transcriptome_seq2iso.tab

I am having an issue with the same step, though I'm not using Trinity. I left a different issue detailing my problem about a month ago but I might be able to figure this out on my own if I knew what this meant so I could alter it to fit my own situation or if I could get an example of what a correct _seq2iso.tab file looks like.

z0on commented 2 years ago

Ah, that’s to create a two-column table where the first column is the contig name and the second column is the gene it belongs to. The code will probably not work with newer trinity versions, but all you need to change in it to match the current version is the regex pattern

((TRINITY.+_g\d+)\S+)

to extract the right string for gene (“component” in old trinity speak) name.

Can you sendme the header of one of your contigs?

On Wed, Aug 17, 2022 at 6:50 PM paigeduffin @.***> wrote:

Can you please describe what the code below does? grep ">" transcriptome.fasta | perl -pe 's/>((TRINITY.+_g\d+)\S+)/$1\t$2/'

transcriptome_seq2iso.tab

I am having an issue with the same step, though I'm not using Trinity. I left a different issue detailing my problem about a month ago but I might be able to figure this out on my own if I knew what this meant so I could alter it to fit my own situation or if I could get an example of what a correct _seq2iso.tab file looks like.

— Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/1#issuecomment-1218568195, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGFNKU5U47NZCVQ7ZQTVZVUDLANCNFSM4IVMWOBQ . You are receiving this because you commented.Message ID: @.***>

-- cheers Misha matzlab.weebly.com

z0on / tag-based_RNAseq

TagSeq pipeline trouble - Generating read-counts-per-gene #1