Open paigeduffin opened 2 years ago
Sending a friendly reminder that I am still struggling with this!
Hi Paige - I’m sorry I missed your initial email! I’m at a conference/workshop this week, will respond ASAP
On Wed, Aug 17, 2022 at 6:51 PM paigeduffin @.***> wrote:
Sending a friendly reminder that I am still struggling with this!
— Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/4#issuecomment-1218569852, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGCCHYD5HVCXGKCZDWDVZVUIXANCNFSM53UHZA6A . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- cheers Misha matzlab.weebly.com
That's okay- thank you for the reply! I'm grateful you may be able to help me out with this! I'll update below a few additional things I've tried since my original post:
I found another Github user ckenkel (relevant repo here) who had some notes in a pdf they uploaded to a forked + modified version of your repo (see Appendix I on the last page of this file in their repo) on creating the *_seq2iso.tab
file that my primary hang-up seems to revolve around.
I modified the sample code to fit my needs (cluster details / general environment still as described above). The first executable I ran was:
$ ml CD-HIT/4.8.1-GCC-10.2.0
$ cd-hit-est -i pisaster_transcriptome.fa -o pisaster_transcriptome_clust.fasta -c 0.99 -G 0 -aL 0.3 -aS 0.3
which successfully generated the intended output pisaster_transcriptome_clust.fasta
(along with another file, titled pisaster_transcriptome_clust.fasta.clstr
). I was encouraged by this! Next I ran:
$ ml Perl/5.32.1-GCCcore-10.3.0
$ perl ./isogroup_namer.pl pisaster_transcriptome.fa pisaster_transcriptome_clust.fasta > pisaster_transcriptome_seq2iso.tab
This generated three output files:
$ pisaster_transcriptome_seq2iso.tab
$ pisaster_transcriptome_iso.fa
$ pisaster_transcriptome_iso_seq2iso.tab
While the other two files were "populated" with information, the one file I expected and needed as an output, pisaster_transcriptome_seq2iso.tab
, was empty. Darn!
Still, I tried running the subsequent step from your code with the pisaster_transcriptome_iso_seq2iso.tab
file instead:
$ perl ./samcount_launch_bt2.pl '\.sam' pisaster_transcriptome_iso_seq2iso.tab > sc
$ screen
$ ml Bowtie2/2.4.5-GCC-10.2.0
$ ml Perl/5.32.1-GCCcore-10.3.0
$ . sc
Sadly, this did not work. The *.sam.counts
files were empty :( another dead end.
I've just forked this repository and uploaded/committed the first part (ie the "head") of some files I've referenced in this issue. Here is that commit: https://github.com/paigeduffin/tag-based_RNAseq/commit/64d0e5479fa66c5f94d3a73bcbdb5ec749233951
After pasting the head()
of these files into a txt document, I named them based on the file they represent, as follows: head_[original_file_name].txt
I hope this all makes sense and that it is helpful in diagnosing what is going on or what I am doing incorrectly... any guidance or help would be so greatly appreciated! Thank you!
Hi Paige - is your transcriptome made by Trinity by any chance?
On Wed, Aug 17, 2022 at 10:56 PM paigeduffin @.***> wrote:
I've just forked this repository and uploaded/committed the first part (ie the "head") of some files I've referenced in this issue. Here is that commit: @.*** https://github.com/paigeduffin/tag-based_RNAseq/commit/64d0e5479fa66c5f94d3a73bcbdb5ec749233951
After pasting the head() of these files into a txt document, I named them based on the file they represent, as follows: head_[original_file_name].txt
I hope this all makes sense and that it is helpful in diagnosing what is going on or what I am doing incorrectly... any guidance or help would be so greatly appreciated! Thank you!
— Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/4#issuecomment-1218973321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGEJJPJ7X7ZT3DDSPD3VZWQ6LANCNFSM53UHZA6A . You are receiving this because you commented.Message ID: @.***>
-- cheers Misha matzlab.weebly.com
Ok, it all comes back slowly to me now… :)
In the simplest case, your seq2iso file can be just two identical columns of contig names - which means each contig is treated as it’s own gene. This is a reasonable choice if your transcriptome is derived from a genome (but make sure to add an extra 500b to 3’end of genes if you are not sure if your gene annotations actually include 3’ UTRs)
If you have a Trinity denovo transcriptome, it makes sense to put “gene” identifier in the second column. Trinity contigs are names like TRINITY_DN1000_c115_g5_i1 You want to extract “g5” out of this string and put it into second column, that’s what my bash command is supposed to do.
If you have some other denovo transcriptome that might have several isoforms per gene, run isogroup_namer.pl without any arguments to see instructions what to do. You did it almost right but not quite - let me know how it goes.
Misha
On Thu, Aug 18, 2022 at 7:10 AM Mikhail V Matz @.***> wrote:
Hi Paige - is your transcriptome made by Trinity by any chance?
On Wed, Aug 17, 2022 at 10:56 PM paigeduffin @.***> wrote:
I've just forked this repository and uploaded/committed the first part (ie the "head") of some files I've referenced in this issue. Here is that commit: @.*** https://github.com/paigeduffin/tag-based_RNAseq/commit/64d0e5479fa66c5f94d3a73bcbdb5ec749233951
After pasting the head() of these files into a txt document, I named them based on the file they represent, as follows: head_[original_file_name].txt
I hope this all makes sense and that it is helpful in diagnosing what is going on or what I am doing incorrectly... any guidance or help would be so greatly appreciated! Thank you!
— Reply to this email directly, view it on GitHub https://github.com/z0on/tag-based_RNAseq/issues/4#issuecomment-1218973321, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGEJJPJ7X7ZT3DDSPD3VZWQ6LANCNFSM53UHZA6A . You are receiving this because you commented.Message ID: @.***>
-- cheers Misha matzlab.weebly.com
-- cheers Misha matzlab.weebly.com
Thank you so much for the reply! I haven't had a chance to try your suggestion yet, but I will be working on it tomorrow!
Good evening all,
I want to apologize in advance for the extremely long thread, but I am so new and uncertain with all of this that I wanted to make sure everything I've done is crystal clear in case I've made any errors along the way. My project seeks to understand differential gene expression in diseased Pisaster ochraceus sea stars. I am attempting to get a pipeline up and running for 90+ tagseq samples with a smaller subset of 3 samples. As I'm not sure where I've gone wrong, I am going to describe every step from the raw reads to where I've hit a roadblock that has completely stopped me in my tracks.
For reference, I am working on the Sapelo2 Linux cluster at the University of Georgia and mainly following along with your tutorial tagSeq_processing_README.txt.
Step 1. Prepare workspace and concatenate raw reads
Step 1A. Copy raw reads into working directory. Raw read pairs have been renamed to reflect the following format:
sample_*_L1.fastq.gz
andsample_*_L2.fastq.gz
. Step 1B. Unzip raw read files.$ gunzip sample_*.fastq.gz
Step 1C. Make surengs_concat.pl
script is in working directory. Step 1D. Load perl module, run script to concatenate paired reads$ ml Perl/5.32.1-GCCcore-10.3.0
$ perl ngs_concat.pl 'sample' 'sample_(.+)_L'
Step 2. Adaptor trimming
Note: At the time I did not understand "screen" + piping to the file named
clean
, so I wrote each adaptor trimming line manually. (I figured out how to do this for themaps
step, so I will do it the "right way" when I scale up to the whole genome.Step 2A. Make sure
tagseq_clipper.pl
script is in working directory.Step 2B. Run code to trim reads. The three samples I'm working with are named
10.8.fq
,1.15.fq
, and11.4.fq
at this stage.$ ml cutadapt/3.4-GCCcore-8.3.0-Python-3.7.4
$ perl tagseq_clipper.pl 10.8.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 10.8.trim
$ perl tagseq_clipper.pl 1.15.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 1.15.trim
$ perl tagseq_clipper.pl 11.4.fq | cutadapt - -a AAAAAAAA -a AGATCGG -q 15 -m 25 -o 11.4.trim
Step 3. Create transcriptome from reference genome.
I am fortunate that I'm working with a species for which there is a published reference genome. After a lot of research on the internet, this is what I came up with to create my in silico transcriptome. I do not know for sure whether or not I've done this correctly- as you'll see, I ran into a few snags.
Step 3A. Download reference genome into working directory.
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/010/994/315/GCA_010994315.2_ASM1099431v2/GCA_010994315.2_ASM1099431v2_genomic.fna.gz
$ gunzip GCA_010994315.2_ASM1099431v2_genomic.fna.gz
$ mv GCA_010994315.2_ASM1099431v2_genomic.fna.gz pisaster_ref_genome.fa
# rename fileStep 3B. Get annotation file into working directory. Download annotation file from colleague's Dryad: https://datadryad.org/stash/dataset/doi:10.6071/M3ND50 Upload annotation file to Sapelo2 in working directory, rename
genome_annotation.gff3
.Step 3C. Troubleshoot and troubleshoot some more. I was getting the same error when I tried many different variations of code that used
gffread
to create transcriptome from the genome and annotation files. Got involved in a Github issue in the repository forgffread
that matched the error I was getting… and it led to me getting my answer: there are spaces in my ref genome sequence names and there should not be.An example of what the first and last sequence names in my file look like:
Spent an extraordinarily long time trying to figure out some code that could efficiently fix this for me and failed. Shamefully, I instead landed on creating a shell script that renamed each contig individually (eg
sed -i 's/CM021526.1/Sc28pcJ_649/g' pisaster_ref_genome_header.mod.fa
for contig 1). But hey, it looks like it worked.Then, I reran my transcriptome building step (
gffread -F -w pisaster_transcriptome.fa -g pisaster_ref_genome_header.mod.fa ann_simple.gff
) and got a new error: "GffObj::getSpliced() error: improper genomic coordinate 17535992 on Sc28pcJ_1844 for g8980.t1". I was frustrated so I just deleted that line, and 5 or 6 other lines that gave me similar issues. After that, it finally worked! As an example, the first two entries in my newpisaster_transcriptome.fa
look like this:Step 4. Back to the tag-based_RNAseq tutorial for mapping: preparing reference transcriptome
Create bowtie2 index for transcriptome.
$ ml Bowtie2/2.4.5-GCC-10.2.0
$ bowtie2-build pisaster_transcriptome.fa pisaster_transcriptome.fa
Here is the text output I was given when I executed the above code: bowtie2_build_output_june.18.2022.txtStep 5. Map reads to transcriptome.
Move relevant perl script to working dir, create
maps
file as in tutorial (this is where I figured out how to usescreen
).$ ml Perl/5.32.1-GCCcore-10.3.0
$ perl tagseq_bowtie2map.pl "trim$" pisaster_transcriptome.fa > maps
$ screen
$ ml Bowtie2/2.4.5-GCC-10.2.0
$ . maps
Here is the text output I was given when I executed the above code: maps_screenOutputFile.txtStep 6. (attempt to) generate read counts per gene.
This is where my big "roadblock" is. At first, I thought I needed to make a true
transcriptome_seq2gene.tab
table, and was really confused by the instructions, because I didn't use Trinity. So I relied on another Github repo that is actually derived from this one to at least generate a populatedseq2iso.tab
file without any error messages. The version of that code that I attempted to modify to fit my needs was:$ grep '^>' pisaster_transcriptome.fa |\ perl -pe 's/>(\S+)\s.*gene:(\S+).*/\1\t\2/g' >\ pisaster_transcriptome_seq2iso.tab
The first two lines of this .tab file are:
As I'm sure you can guess, this did not work. After running the following:
$ perl ./samcount_launch_bt2.pl '\.sam' pisaster_transcriptome_seq2iso.tab > sc
$ screen
$ ml Bowtie2/2.4.5-GCC-10.2.0
$ ml Perl/5.32.1-GCCcore-10.3.0
$ . sc
...I received a massive chain of error messages (egg19414.t1 has no isogroup designation
) and the.sam.counts
files were empty.Finally, today, it dawned on me that this line in the instructions is (I believe) relevant to me:
I followed these instructions very literally and created a dummy .tab file named "dummy_seq2iso.tab" with one line of text "gene [tab] gene". However, when I reran the above code with this new .tab file, I received this error x3 (one for each sample, I presume):
I made sure restrictive file permissions were not the cause of this error and did a lot of sleuthing around on the internet, but I am at a loss on what to do next.
Any insight on this last step specifically, or if you believe the error is in one of these previous steps I've described, would be IMMENSELY appreciated.
Thank you so much in advance.
Paige Duffin