Closed rzhan186 closed 1 year ago
The test failing is concerning the midori database which is in the process of being updated, so it is expected. The test concerning precomputed hits is passing.
Looking at the blast hits file you provided in the example, it contains spaces instead of tabs. It could be simply that your text editor converted all tabs to spaces silently?
It's fine to leave the comments in. Lines with '#' are ignored by the BioPython parser.
Hi @xapple ,
Thank you for your prompt reply! Crest worked after replacing the spaces with tabs! I got something like this
c_000000000014_hx1b_ssu_aligned_rna No hits
c_000000000032_hx1b_ssu_aligned_rna root; Main genome; Eukaryota; Archaeplastida; Chloroplastida; Embryophyta; eudicotyledons; Brassicales; Cleomaceae; Tarenaya; Tarenaya hassleriana
c_000000000050_hx1b_ssu_aligned_rna root; Main genome; Eukaryota; Archaeplastida; Chloroplastida; Embryophyta
However, I'd also like crest to create OTU tables, when I run this
crest4 \
--fasta $fasta/aligned.SSU.subsampled.fa \
--search_hits $SCRATCH/test.hits \
-o $output \
--otu_table $output
I got this error
^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/scratch/ruizhang/crest4/test/'
I thought I had to make a file manually so crest can write into this file, however, this didn't work either.
crest4 \
--fasta $fasta/aligned.${site}.SSU.subsampled.fa \
--search_hits $SCRATCH/test.hits \
-o $output \
--otu_table $output/table.tsv
Could you also help me troubleshooting this please?
Hi Rui,
This is failing because CREST cannot generate OTU output files. The --otu_table option is expecting an input file (not output) with estimated abundance of each OTU (or sequence variant) across different samples or datasets. It then uses this to calculate taxonomic abundances across datasets and also returns an OTU table with a column added, representing the taxonomic annotation of each OTU.
I am not sure what you want to do or how you want to generate an OTU table from only the sequence input, but using only a FASTA formatted file as input, CREST cannot determine which sample that each sequence read is from, and there is not function for this.
One thing you could do is to use an assembly tool adapted to rRNA sequences, like MetaRib (https://github.com/yxxue/MetaRib) or EMIRGE (https://github.com/csmiller/EMIRGE). The later is easier to install and more simple, but would probably only work for input files of <10 million reads or so. Using assembled rRNA sequences would then improve your taxonomic assignments and decrease the size of your dataset. You could also merge rRNA contigs across assemblies (e.g. samples), map your reads to this "SSU rRNA catalog" and create an OTU-table with resulting abundances across datasets, that you can then submit to CREST in addition to a FASTA files with merged contigs.
Anders
Dear Anders,
Thank you for the clarification about the OTU table and your insights on RNA reads assembly! What I am hoping to do is just to get the taxonomic distribution (or relative abundance) of the microbial taxa at different levels across my metatranscriptomic samples (I have 6 in total).
I just have a couple of follow-up questions:
You mentioned that "You could also merge rRNA contigs across assemblies (e.g. samples), map your reads to this "SSU rRNA catalog" and create an OTU-table with resulting abundances across datasets
". I understand that I could use mapping tools like bowtie2 or BWA to map my reads across samples to the “SSU rRNA catalog”, but for the second part, which tools can I use to create an OTU table? I am thinking of something like the ‘jgi_summarize_bam_contig_depths’
function from the metabat2 software, which generates contigs depth across samples (sort of like an OTU table). But I am not sure if this is what you meant.
You can then submit to CREST in addition to a FASTA files with merged contigs
What would the hits file look like in this case if I were to provide it to CREST, would it be a concatenated hits file where I merge individual hits files for each of my samples (I have run blast against silvermod138 for each of my samples individually).
If I stick with these rRNA reads (as opposed to assemblies), is there any tool that can help me summarize the results in the CREST output file (i.e., assignment.txt), so I can get the relative abundance of each taxon across the taxonomic hierarchy?
Based on your experience, would there be a big discrepancy between results obtained using reasd- and assembly-based approaches?
Thank you very much, I really appreciate it!
Hi,
1) Yes, after you have a set of SSU rRNA contigs (note that silvamod does not have LSU sequences which will also result from a "total RNA" metatranscriptome), you should be able to merge them using e.g. dedup or similar. Then you could map your read files to it. I use the below bash script, that you can modify to fit your file names etc. You use it with all forward read files (that have to end with "R1_001.fastq.gz" in my version). The result should be a file for each sample ending with "contig_abundances.txt". You can merge the columns of this file to produce an "OTU table"
contigs=final.contigs.fa
threads=12
#bwa index $contigs
bowtie2-build $contigs contigs
for read in $*; do
stub=${read//R1_001.fastq.gz}
r2=${stub}R2_001.fastq.gz
#bwa mem -t $threads ${contigs} $read > ${stub}.sam
bowtie2 -x contigs -1 $read -2 $r2 -S $stub.sam -p $threads
samtools view -b -S ${stub}.sam -o ${stub}.bam -@ $threads
samtools sort ${stub}.bam -o ${stub}_sorted.bam -@ $threads
samtools index ${stub}_sorted.bam -@ $threads
samtools idxstats ${stub}_sorted.bam > ${stub}_contig_abundances.txt
done
2-3) I attach two truncated example files of the output that is given when you use an OTU table. otus_cumulative counts all assignments to the listed taxon and all its children, e.g. lower taxonomic ranks of the same taxon. otus_by_rank only counts direct assignments, i.e. if the Protoebacteria has 2 reads assigned for sample1, this means that 2 reads could be assigned to this phylum only, not including other reads assigned to lower proteobacteria ranks. However, this output is not returned when you use only individual reads, but you could make a "mock table" if you want that has a header (e.g. OTU, sample") the name of each read in the first column and then the value 1 in the second, to get this output.
4) We did some comparisons and you get a lot better resolution, i.e. assignments to genus and species rank, with assembly, and fewer false positives, but on the other hand you loose some rare taxa, that have too few reads to be assembled. Another advantage is that it is quite some work to run EMIRGE or MetaRib and map reads into an OTU table, but it is what I would choose. You can read more about it in our paper: https://pubmed.ncbi.nlm.nih.gov/32167532/
Thanks for your insights, I will give it a try and see how it goes!
Dear @xapple and @lanzen ,
Sorry for coming back here with another issue after several months. I am trying to classify some SSU rRNA short reads obtained from metatranscriptomic sequencing. Because the rRNA files are kind of large (~6GB), I split each file into 10 equal parts and ran them against the silvamod138 database with Blastn, as suggested by @lanzen in issue #1 .
I am using crest4 install via pip in a virtual python environment.
The hits file looks something like this
When I provide this hits profile into crest4, I received the following error message
I thought it might be a format issue, then I removed the parts in the file starting with "#" to only use this part
I received the following message
When I try to open the
assignments.txt
file, it's empty.Then I ran 'crest4 --pytest' to make sure the software is build successfully, but it actually failed one test. Could this be the reason why CREST is not working?
Sorry for this lengthy question.