xapple / crest4

The `crest4` python package can automatically assign taxonomic names to DNA sequences obtained from environmental sequencing.
GNU General Public License v3.0
5 stars 0 forks source link

CREST4 produces empty assignment.txt file #2

Closed rzhan186 closed 1 year ago

rzhan186 commented 1 year ago

Dear @xapple and @lanzen ,

Sorry for coming back here with another issue after several months. I am trying to classify some SSU rRNA short reads obtained from metatranscriptomic sequencing. Because the rRNA files are kind of large (~6GB), I split each file into 10 equal parts and ran them against the silvamod138 database with Blastn, as suggested by @lanzen in issue #1 .

I am using crest4 install via pip in a virtual python environment.

blastn \
-query $input/aligned.SSU.subsampled.part_001.fa \
-db ~/.crest4/silvamod138/silvamod138.fasta \
-num_alignments 10 \
-outfmt "7 qseqid sseqid bitscore length nident" \
-out $output/rRNA_silva138_part_001.hits \
-num_threads 32

The hits file looks something like this

# BLASTN 2.13.0+
# Query: c_000000000001_hx1b_ssu_aligned_rna
# Database: /home/ruizhang/.crest4/silvamod138/silvamod138.fasta
# 0 hits found
# BLASTN 2.13.0+
# Query: c_000000000014_hx1b_ssu_aligned_rna
# Database: /home/ruizhang/.crest4/silvamod138/silvamod138.fasta
# 0 hits found
# BLASTN 2.13.0+
# Query: c_000000000032_hx1b_ssu_aligned_rna
# Database: /home/ruizhang/.crest4/silvamod138/silvamod138.fasta
# Fields: query id, subject id, bit score, alignment length, identical
# 5 hits found
c_000000000032_hx1b_ssu_aligned_rna     AOUI01017441    182     138     125
c_000000000032_hx1b_ssu_aligned_rna     AXCG01145984    176     138     124
c_000000000032_hx1b_ssu_aligned_rna     APFE030523816   152     131     115
c_000000000032_hx1b_ssu_aligned_rna     BDFN01002883    150     130     115
c_000000000032_hx1b_ssu_aligned_rna     CP027782        86.1    52      50

When I provide this hits profile into crest4, I received the following error message

crest4 \
> --fasta $fasta/aligned.SSU.subsampled.fa \
> --search_hits test.hits \
> -o $output \
> --otu_table $output
Running crest4 v.4.2.7
Traceback (most recent call last):
  File "/project/6004066/ruizhang/software/anvio-7.1/bin/crest4", line 8, in <module>
    sys.exit(main())
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/crest4/__main__.py", line 20, in main
    return magic()
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/optmagic/__init__.py", line 250, in __call__
    return instance(*extra_args, **extra_kwargs)
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/crest4/classify.py", line 350, in __call__
    self.out_file.writelines(query.tax_string for query in self.queries)
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/plumbing/cache.py", line 86, in __get__
    else:                                      result = self.func(instance)
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/crest4/classify.py", line 305, in queries
    result = [Query(self, query) for query in self.seqsearch.results]
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/crest4/classify.py", line 305, in <listcomp>
    result = [Query(self, query) for query in self.seqsearch.results]
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/seqsearch/search/blast.py", line 153, in results
    for entry in SearchIO.parse(handle, 'blast-tab', comments=True):
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/Bio/SearchIO/__init__.py", line 306, in parse
    yield from generator
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/Bio/SearchIO/BlastIO/blast_tab.py", line 234, in __iter__
    yield from iterfunc()
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/Bio/SearchIO/BlastIO/blast_tab.py", line 271, in _parse_commented_qresult
    for qresult in qres_iter:
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/Bio/SearchIO/BlastIO/blast_tab.py", line 406, in _parse_qresult
    cur = self._parse_result_row()
  File "/project/6004066/ruizhang/software/anvio-7.1/lib/python3.8/site-packages/Bio/SearchIO/BlastIO/blast_tab.py", line 332, in _parse_result_row
    raise ValueError(
ValueError: Expected 5 columns, found: 1

I thought it might be a format issue, then I removed the parts in the file starting with "#" to only use this part

c_000000000032_hx1b_ssu_aligned_rna     AOUI01017441    182     138     125
c_000000000032_hx1b_ssu_aligned_rna     AXCG01145984    176     138     124
c_000000000032_hx1b_ssu_aligned_rna     APFE030523816   152     131     115
c_000000000032_hx1b_ssu_aligned_rna     BDFN01002883    150     130     115
c_000000000032_hx1b_ssu_aligned_rna     CP027782        86.1    52      50

I received the following message

Running crest4 v.4.3.0
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/Bio/SearchIO/_legacy/__init__.py:12: BiopythonDeprecationWarning: The 'Bio.SearchIO._legacy' module for parsing BLAST plain text output is deprecated and will be removed in a future release of Biopython. Consider generating your BLAST output for parsing as XML or tabular format instead.
  warnings.warn(
Classification ran successfully. Results are placed in '/scratch/ruizhang/crest4/test/assignments.txt'.

When I try to open the assignments.txt file, it's empty.

Then I ran 'crest4 --pytest' to make sure the software is build successfully, but it actually failed one test. Could this be the reason why CREST is not working?

crest4 --pytest
===================================== test session starts =====================================
platform linux -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
rootdir: /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests
configfile: pytest.ini
collected 15 items                                                                            

../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/actualize_database/run_test.py . [  6%]
.

.                                                                                      [ 20%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/cmd_line_tool/run_test.py . [ 26%]
                                                                                        [ 26%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/custom_database/run_test.py . [ 33%]
                                                                                        [ 33%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/midori_test/run_test.py F [ 40%]
                                                                                        [ 40%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/ncbi_two_sequences/run_test.py . [ 46%]
                                                                                        [ 46%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/no_hits_sequence/run_test.py . [ 53%]
.                                                                                       [ 60%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/precomputed_hits/run_test.py . [ 66%]
                                                                                        [ 66%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/readme_example_seq/run_test.py . [ 73%]
                                                                                        [ 73%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/score_drop_change/run_test.py . [ 80%]
                                                                                        [ 80%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/vsearch_test/run_test.py . [ 86%]
                                                                                        [ 86%]
../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/with_otu_table/run_test.py . [ 93%]
.                                                                                       [100%]

========================================== FAILURES ===========================================
_________________________________________ test_midori _________________________________________

    def test_midori():
        # The input fasta #
        fasta = this_dir.find('*.fasta')
        # The output directory #
        output_dir = this_dir + 'results/'
        output_dir.remove()
        # Create object #
        c = Classify(fasta       = fasta,
                     output_dir  = output_dir,
                     search_algo = 'vsearch',
                     search_db   = 'midori253darn',
                     num_threads = True)
        # Run it #
>       c()

/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/midori_test/run_test.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/classify.py:351: in __call__
    self.out_file.writelines(query.tax_string for query in self.queries)
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/plumbing/cache.py:86: in __get__
    else:                                      result = self.func(instance)
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/classify.py:304: in queries
    if not self.search_hits: self.search()
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/classify.py:285: in search
    return self.seqsearch.run()
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/plumbing/cache.py:86: in __get__
    else:                                      result = self.func(instance)
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/classify.py:273: in seqsearch
    return SeqSearch(input_fasta = self.fasta,
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/seqsearch/search/__init__.py:81: in __init__
    self.set_defaults()
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/seqsearch/search/__init__.py:93: in set_defaults
    if self.algorithm == 'vsearch' and hasattr(self.database, 'vsearch_db'):
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/plumbing/cache.py:86: in __get__
    else:                                      result = self.func(instance)
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/databases.py:213: in vsearch_db
    if not self.downloaded: self.download()
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/databases.py:171: in download
    download_from_url(self.url,
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/plumbing/scraping/__init__.py:82: in download_from_url
    if stream: response = request(url, header, response=True, stream=True, **kwargs)
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/decorator.py:232: in fun
    return caller(func, *(extras + args), **kw)
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/retry/api.py:73: in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/retry/api.py:33: in __retry_internal
    return f()
/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/plumbing/scraping/__init__.py:35: in request
    resp.raise_for_status()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Response [403]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""

        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason

        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )

        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )

        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz

/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/requests/models.py:1021: HTTPError
------------------------------------ Captured stdout call -------------------------------------
Running crest4 v.4.3.0

          ╭───────────────────── Large Download ─────────────────────╮          
          │                                                          │          
          │                                                          │          
          │  The database 'midori253darn' has not been downloaded    │          
          │  yet. This process will start now and might take some    │          
          │  time depending on your internet connection. Please be   │          
          │  patient. The result will be saved to                    │          
          │  '/home/ruizhang/.crest4/'. You can override this by     │          
          │  setting the $CREST4_DIR environment variable.           │          
          │                                                          │          
          │                                                          │          
          ╰──────────────────────────────────────────────────────────╯          

-------------------------------------- Captured log call --------------------------------------
WARNING  retry.api:api.py:40 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz, retrying in 1 seconds...
WARNING  retry.api:api.py:40 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz, retrying in 2 seconds...
WARNING  retry.api:api.py:40 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz, retrying in 4 seconds...
WARNING  retry.api:api.py:40 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz, retrying in 8 seconds...
WARNING  retry.api:api.py:40 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz, retrying in 16 seconds...
WARNING  retry.api:api.py:40 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz, retrying in 32 seconds...
WARNING  retry.api:api.py:40 403 Client Error: Forbidden for url: https://crest4.s3-eu-west-1.amazonaws.com/midori253darn.tar.gz, retrying in 64 seconds...
====================================== warnings summary =======================================
cmd_line_tool/run_test.py::test_cmd_line_tool
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but cmd_line_tool/run_test.py::test_cmd_line_tool returned "Running crest4 v.4.3.0\nClassification ran successfully. Results are placed in '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/cmd_line_tool/results/assignments.txt'.\n", which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

custom_database/run_test.py::test_custom_database
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/Bio/SearchIO/_legacy/__init__.py:12: BiopythonDeprecationWarning: The 'Bio.SearchIO._legacy' module for parsing BLAST plain text output is deprecated and will be removed in a future release of Biopython. Consider generating your BLAST output for parsing as XML or tabular format instead.
    warnings.warn(

custom_database/run_test.py::test_custom_database
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/ete3/webplugin/webapp.py:44: DeprecationWarning: 'cgi' is deprecated and slated for removal in Python 3.13
    import cgi

custom_database/run_test.py::test_custom_database
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but custom_database/run_test.py::test_custom_database returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/custom_database/two_seqs.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

ncbi_two_sequences/run_test.py::test_ncbi_two_seqs
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but ncbi_two_sequences/run_test.py::test_ncbi_two_seqs returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/ncbi_two_sequences/two_seqs.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

no_hits_sequence/run_test.py::test_no_hits_blast
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but no_hits_sequence/run_test.py::test_no_hits_blast returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/no_hits_sequence/no_hits.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

no_hits_sequence/run_test.py::test_no_hits_vsearch
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but no_hits_sequence/run_test.py::test_no_hits_vsearch returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/no_hits_sequence/no_hits.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

precomputed_hits/run_test.py::test_precomputed_hits
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but precomputed_hits/run_test.py::test_precomputed_hits returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/precomputed_hits/two_seqs.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

readme_example_seq/run_test.py::test_readme_example_seq
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but readme_example_seq/run_test.py::test_readme_example_seq returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/readme_example_seq/readme_test.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

score_drop_change/run_test.py::test_score_drop_change
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but score_drop_change/run_test.py::test_score_drop_change returned "Running crest4 v.4.3.0\nClassification ran successfully. Results are placed in '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/score_drop_change/results/assignments.txt'.\n", which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

vsearch_test/run_test.py::test_vsearch
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but vsearch_test/run_test.py::test_vsearch returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/vsearch_test/two_seqs.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

with_otu_table/run_test.py::test_with_good_otu_table
  /project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/_pytest/python.py:198: PytestReturnNotNoneWarning: Expected None, but with_otu_table/run_test.py::test_with_good_otu_table returned <Classify object on '/project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/with_otu_table/seqs.fasta'>, which will be an error in a future version of pytest.  Did you mean to use `assert` instead of `return`?
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================== short test summary info ===================================
FAILED ../../../../project/6004066/ruizhang/software/crest4/lib/python3.11/site-packages/crest4/tests/midori_test/run_test.py::test_midori - requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://crest4.s3-eu-w...
==================== 1 failed, 14 passed, 12 warnings in 498.09s (0:08:18) ====================

Sorry for this lengthy question.

xapple commented 1 year ago

The test failing is concerning the midori database which is in the process of being updated, so it is expected. The test concerning precomputed hits is passing.

Looking at the blast hits file you provided in the example, it contains spaces instead of tabs. It could be simply that your text editor converted all tabs to spaces silently?

Screenshot 2023-07-12 at 17 52 16

It's fine to leave the comments in. Lines with '#' are ignored by the BioPython parser.

rzhan186 commented 1 year ago

Hi @xapple ,

Thank you for your prompt reply! Crest worked after replacing the spaces with tabs! I got something like this

c_000000000014_hx1b_ssu_aligned_rna     No hits
c_000000000032_hx1b_ssu_aligned_rna     root; Main genome; Eukaryota; Archaeplastida; Chloroplastida; Embryophyta; eudicotyledons; Brassicales; Cleomaceae; Tarenaya; Tarenaya hassleriana
c_000000000050_hx1b_ssu_aligned_rna     root; Main genome; Eukaryota; Archaeplastida; Chloroplastida; Embryophyta

However, I'd also like crest to create OTU tables, when I run this

crest4 \
--fasta $fasta/aligned.SSU.subsampled.fa \
--search_hits $SCRATCH/test.hits \
-o $output \
--otu_table $output

I got this error

             ^^^^^
IsADirectoryError: [Errno 21] Is a directory: '/scratch/ruizhang/crest4/test/'

I thought I had to make a file manually so crest can write into this file, however, this didn't work either.

crest4 \
--fasta $fasta/aligned.${site}.SSU.subsampled.fa \
--search_hits $SCRATCH/test.hits \
-o $output \
--otu_table $output/table.tsv

Could you also help me troubleshooting this please?

lanzen commented 1 year ago

Hi Rui,

This is failing because CREST cannot generate OTU output files. The --otu_table option is expecting an input file (not output) with estimated abundance of each OTU (or sequence variant) across different samples or datasets. It then uses this to calculate taxonomic abundances across datasets and also returns an OTU table with a column added, representing the taxonomic annotation of each OTU.

I am not sure what you want to do or how you want to generate an OTU table from only the sequence input, but using only a FASTA formatted file as input, CREST cannot determine which sample that each sequence read is from, and there is not function for this.

One thing you could do is to use an assembly tool adapted to rRNA sequences, like MetaRib (https://github.com/yxxue/MetaRib) or EMIRGE (https://github.com/csmiller/EMIRGE). The later is easier to install and more simple, but would probably only work for input files of <10 million reads or so. Using assembled rRNA sequences would then improve your taxonomic assignments and decrease the size of your dataset. You could also merge rRNA contigs across assemblies (e.g. samples), map your reads to this "SSU rRNA catalog" and create an OTU-table with resulting abundances across datasets, that you can then submit to CREST in addition to a FASTA files with merged contigs.

Anders

rzhan186 commented 1 year ago

Dear Anders,

Thank you for the clarification about the OTU table and your insights on RNA reads assembly! What I am hoping to do is just to get the taxonomic distribution (or relative abundance) of the microbial taxa at different levels across my metatranscriptomic samples (I have 6 in total).

I just have a couple of follow-up questions:

  1. You mentioned that "You could also merge rRNA contigs across assemblies (e.g. samples), map your reads to this "SSU rRNA catalog" and create an OTU-table with resulting abundances across datasets". I understand that I could use mapping tools like bowtie2 or BWA to map my reads across samples to the “SSU rRNA catalog”, but for the second part, which tools can I use to create an OTU table? I am thinking of something like the ‘jgi_summarize_bam_contig_depths’ function from the metabat2 software, which generates contigs depth across samples (sort of like an OTU table). But I am not sure if this is what you meant.

  2. You can then submit to CREST in addition to a FASTA files with merged contigs What would the hits file look like in this case if I were to provide it to CREST, would it be a concatenated hits file where I merge individual hits files for each of my samples (I have run blast against silvermod138 for each of my samples individually).

  3. If I stick with these rRNA reads (as opposed to assemblies), is there any tool that can help me summarize the results in the CREST output file (i.e., assignment.txt), so I can get the relative abundance of each taxon across the taxonomic hierarchy?

  4. Based on your experience, would there be a big discrepancy between results obtained using reasd- and assembly-based approaches?

Thank you very much, I really appreciate it!

lanzen commented 1 year ago

Hi,

1) Yes, after you have a set of SSU rRNA contigs (note that silvamod does not have LSU sequences which will also result from a "total RNA" metatranscriptome), you should be able to merge them using e.g. dedup or similar. Then you could map your read files to it. I use the below bash script, that you can modify to fit your file names etc. You use it with all forward read files (that have to end with "R1_001.fastq.gz" in my version). The result should be a file for each sample ending with "contig_abundances.txt". You can merge the columns of this file to produce an "OTU table"

contigs=final.contigs.fa
threads=12

#bwa index $contigs
bowtie2-build $contigs contigs

for read in $*; do
    stub=${read//R1_001.fastq.gz}
    r2=${stub}R2_001.fastq.gz
    #bwa mem -t $threads ${contigs} $read > ${stub}.sam
    bowtie2 -x contigs -1 $read -2 $r2 -S $stub.sam -p $threads
    samtools view -b -S ${stub}.sam -o ${stub}.bam -@ $threads
    samtools sort ${stub}.bam -o ${stub}_sorted.bam -@ $threads 
    samtools index ${stub}_sorted.bam -@ $threads
    samtools idxstats ${stub}_sorted.bam > ${stub}_contig_abundances.txt 
done

2-3) I attach two truncated example files of the output that is given when you use an OTU table. otus_cumulative counts all assignments to the listed taxon and all its children, e.g. lower taxonomic ranks of the same taxon. otus_by_rank only counts direct assignments, i.e. if the Protoebacteria has 2 reads assigned for sample1, this means that 2 reads could be assigned to this phylum only, not including other reads assigned to lower proteobacteria ranks. However, this output is not returned when you use only individual reads, but you could make a "mock table" if you want that has a header (e.g. OTU, sample") the name of each read in the first column and then the value 1 in the second, to get this output.

4) We did some comparisons and you get a lot better resolution, i.e. assignments to genus and species rank, with assembly, and fewer false positives, but on the other hand you loose some rare taxa, that have too few reads to be assembled. Another advantage is that it is quite some work to run EMIRGE or MetaRib and map reads into an OTU table, but it is what I would choose. You can read more about it in our paper: https://pubmed.ncbi.nlm.nih.gov/32167532/

otus_cumulative.csv otus_by_rank.csv

rzhan186 commented 1 year ago

Thanks for your insights, I will give it a try and see how it goes!