wegnerce / peng_et_al_2018

Details about metatranscriptome data processing in Peng et al., 2018 https://doi.org/10.1186/s40168-018-0546-9
3 stars 0 forks source link

doubts on mRNA analysis using sortmerna and Megan6 #1

Open sumitra20 opened 1 year ago

sumitra20 commented 1 year ago

Hi @wegnerce ,

Im currently working on metatranscriptome of plant samples as well and i've been referring to this amazing work for as a guidance for my bioinformatic analysis. Im very new to all bioinformatic analysis itself and i have several doubts to clarify. For my samples, the rRNAs from both eukaryotes and prokaryotes were depleted from total RNA samples prior to sequencing. After the pre-processing, i performed sortmerna2.0 with my raw data upon merging the paired reads and my output file after removing the remaining rRNAs is quite huge (10-12gb approx). Im not sure if im performing the analysis correctly. I'll attach the command line i used just for your reference. I then use the output file to run diamond blast and import it into Megan6 to get taxonomic & functional annotations.

Also, based the current work the bacterial and archaeal mRNA reads were said to be extracted through MEGAN6 tool after sortmerna. Do you mind explaining how the extraction was done, im struggling to understand how this works?

For mRNA analysis, reads mapping to rRNA and non-coding small RNA were filtered by SortMeRNA 2.0, using SILVA (release 128) and RFAM as reference databases. The remaining reads were considered putative mRNA. We identified only a low proportion of mRNA reads (0.39–3.20%) to be derived from Eukaryota. Like for SSU rRNA, we therefore extracted only bacterial and archaeal mRNA reads through MEGAN6 Ultimate Edition for downstream analysis.

Did really appreciate your response.

sortmerna -ref ./rRNA_databases/silva-bac-16s-id90.fasta -ref ./rRNA_databases/silva-bac-23s-id98.fasta -ref ./rRNA_databases/silva-arc-16s-id95.fasta -ref ./rRNA_databases/silva-arc-23s-id98.fasta -ref ./rRNA_databases/silva-euk-18s-id95.fasta -ref ./rRNA_databases/silva-euk-28s-id98.fasta -ref ./rRNA_databases/rfam-5s-database-id98.fasta -ref ./rRNA_databases/rfam-5.8s-database-id98.fasta --reads ./R33P4_R2_sortme_merge.fq -sam -num_alignments 1 -fastx -aligned ./R33P4_R2_sortme_rRNA -other ./R33P4_R2_sortme_non_rRNA fastq -v -m 32219 -threads 16

wegnerce commented 1 year ago

Hi sumitra20,

I'm travelling right now. I hope the following helps. I'm back in the office on September 20th.

If I get your question right, you are struggling with MEGAN a bit. Extracting reads is super straightforward. If you have your data opened in MEGAN (as rma6 file, or just after import and annotation), you can right-click on any taxonomic node on any taxonomic level and select to extract the corresponding reads into a .fasta file. Please be aware that you can choose to include "Summarized Reads". This allows you to either extract only those reads that were assigned to this exact classification (e.g. Enterobacteriaceae), or also reads with a more resolved (I hope this makes sense) assignment (e.g. Reads assigned to Enterobacteriaceae, but also those assigned to genera within this family, let's say Escherichia).

Best

On Tue, 6 Sept 2022 at 05:29, sumitra20 @.***> wrote:

Hi @wegnerce https://github.com/wegnerce ,

Im currently working on metatranscriptome of plant samples as well and i've been referring to this amazing work for as a guidance for my bioinformatic analysis. Im very new to all bioinformatic analysis itself and i have several doubts to clarify. For my samples, the rRNAs from both eukaryotes and prokaryotes were depleted from total RNA samples prior to sequencing. After the pre-processing, i performed sortmerna2.0 with my raw data upon merging the paired reads and my output file after removing the remaining rRNAs is quite huge (10-12gb approx). Im not sure if im performing the analysis correctly. I'll attach the command line i used just for your reference. I then use the output file to run diamond blast and import it into Megan6 to get taxonomic & functional annotations.

Also, based the current work the bacterial and archaeal mRNA reads were said to be extracted through MEGAN6 tool after sortmerna. Do you mind explaining how the extraction was done, im struggling to understand how this works?

For mRNA analysis, reads mapping to rRNA and non-coding small RNA were filtered by SortMeRNA 2.0, using SILVA (release 128) and RFAM as reference databases. The remaining reads were considered putative mRNA. We identified only a low proportion of mRNA reads (0.39–3.20%) to be derived from Eukaryota. Like for SSU rRNA, we therefore extracted only bacterial and archaeal mRNA reads through MEGAN6 Ultimate Edition for downstream analysis.

Did really appreciate your response.

sortmerna -ref ./rRNA_databases/silva-bac-16s-id90.fasta -ref ./rRNA_databases/silva-bac-23s-id98.fasta -ref ./rRNA_databases/silva-arc-16s-id95.fasta -ref ./rRNA_databases/silva-arc-23s-id98.fasta -ref ./rRNA_databases/silva-euk-18s-id95.fasta -ref ./rRNA_databases/silva-euk-28s-id98.fasta -ref ./rRNA_databases/rfam-5s-database-id98.fasta -ref ./rRNA_databases/rfam-5.8s-database-id98.fasta --reads ./R33P4_R2_sortme_merge.fq -sam -num_alignments 1 -fastx -aligned ./R33P4_R2_sortme_rRNA -other ./R33P4_R2_sortme_non_rRNA fastq -v -m 32219 -threads 16

— Reply to this email directly, view it on GitHub https://github.com/wegnerce/peng_et_al_2018/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB425RWFANODDPA3HKKVJI3V423ARANCNFSM6AAAAAAQFNRCXY . You are receiving this because you were mentioned.Message ID: @.***>

sumitra20 commented 1 year ago

Hi @wegnerce ,

Thank you for the response, i hope you're back. Ahh ok, so that's how it works!! I have been extracting out .fasta files using the same approach previously. But got worried when i saw about the extraction of bacterial mRNA in your paper, i wasn't sure if i was doing it right. Since im new to all this i tend to get a bit worried. My next concern is about the CAZyme analysis using my mRNA reads. I've downloaded the latest database and performed a diamond run against the database, but im not sure how to get a final summary (abundance+enzyme family) out.

These are the steps i followed through:

#download CaZyme databse
wget http://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fa

#index it to diamond format
/home/combio7/diamond-2.0.12/bin/diamond makedb --in ./CAZyDB.08062022.fa -d CAZy

#diamond_blast

/home/combio7/diamond-2.0.12/bin/diamond blastx -d CAZy.dmnd -q /media/combio7/8TB_Backup/sumitra_130122/cycle_5_30822/5_sortme_cycle5/R33P5_R_sortme_non_rRNA.fq -o ./R33P5_R_cazyme.tab -b 20.0 -f 6 -e 1e-3

#donwloading mapping file 
wget https://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fam.subfam.ec.txt

#downloaded cazy_data.txt with taxonomic data from http://www.cazy.org/IMG/cazy_data/cazy_data.zip

My diamond output looks something like this: A00917:904:H5GY5DSX3:2:1108:26332:33035 CAA77237.1|GT75|2.4.1.- 93.9 49 3 0 2 148 167 215 2.07e-27 105 A00917:904:H5GY5DSX3:2:1108:26332:33035 AKZ17608.1|GT75 93.9 49 3 0 2 148 167 215 2.15e-27 105 A00917:904:H5GY5DSX3:2:1108:26332:33035 CAH59419.1|GT75 89.8 49 5 0 2 148 18 66 4.96e-27 101 A00917:904:H5GY5DSX3:2:1108:26332:33035 CAD1819425.1|GT75 89.8 49 5 0 2 148 127 175 4.98e-27 101 A00917:904:H5GY5DSX3:2:1108:26332:33035 QAX90478.1|GT75 91.8 49 4 0 2 148 159 207 5.01e-27 103 A00917:904:H5GY5DSX3:2:1108:26332:33035 AFK40119.1|GT75 91.8 49 4 0 2 148 159 207 5.07e-27 103 A00917:904:H5GY5DSX3:2:1108:26332:33035 ABA81861.1|GT75|2.4.1.- 91.8 49 4 0 2 148 161 209 5.27e-27 103 A00917:904:H5GY5DSX3:2:1108:26332:33035 CAG1848511.1|GT75 91.8 49 4 0 2 148 164 212 5.27e-27 103 A00917:904:H5GY5DSX3:2:1108:26332:33035 CAG1843805.1|GT75 89.8 49 5 0 2 148 163 211 5.34e-27 103 A00917:904:H5GY5DSX3:2:1108:26332:33035 CAC84517.1|GT75|2.4.1.- 91.8 49 4 0 2 148 161 209 5.48e-27 103 A00917:904:H5GY5DSX3:2:1108:26332:33035 CAE5990266.1|GT75 98.0 49 1 0 147 1 1 49 8.13e-34 114 A00917:904:H5GY5DSX3:2:1108:26332:33035 ABR26070.1|GT75|2.4.1.- 100 50 0 0 150 1 64 113 1.01e-33 118 A00917:904:H5GY5DSX3:2:1108:26332:33035 AFK48049.1|GT75 96.0 50 2 0 150 1 14 63 7.18e-33 114 A00917:904:H5GY5DSX3:2:1108:26332:33035 AFW90628.1|GT75 96.0 50 2 0 150 1 10 59 9.74e-33 114 A00917:904:H5GY5DSX3:2:1108:26332:33035 CAA09469.1|GT75|2.4.1.-|5.4.99.30 100 50 0 0 150 1 213 262 1.76e-32 118

I tried using python dbCAN_annotator.py but i ended with error python dbCAN_annotator.py --mode annot --diamond ./R33P5_R_cazyme.tab --seq ./CAZyme/R33P5_R_czy_trial.fasta

error

look_up = check_dbCAN(line.split("\t")[1].split("|")[1]) # we look up GO tags for Uniref TypeError: a bytes-like object is required, not 'str'

Do you have any advice/suggestion on how i can proceed? Do you think i should manually map my output against CAZyDB.08062022.fam.subfam.ec.txt and cazy_data.txt? But then again by doing this i wont be able to get the enzyme subfamily name, which i'll also have to search manually.