zheminzhou / SPARSE

Strain Prediction and Analysis with Representative SEquence
https://www.biorxiv.org/content/biorxiv/early/2017/11/07/215707.full.pdf
GNU General Public License v3.0
18 stars 4 forks source link

Create custom database #6

Open jacodela opened 6 years ago

jacodela commented 6 years ago

I'm interested in mapping metagenome reads to genome bins I've previously assembled and are not available in public databases. The documentation regarding the creation of a custom databases is limited to subsetting the provided refseq representative database given a genome with a know accession number, but I can't seem to find how to create a truly custom database. Is this even possible?

palomo11 commented 5 years ago

I have the same question. @jacodela Did you figure out how to do it?

jacodela commented 5 years ago

Hi @palomo11, I never got an answer, nor I figured out how to do it by myself, so I used other tools. I would recommend you take a look at Bracken or (meta)Kallisto: they run quite fast and perform well in some tests I ran myself on synthetic communities. If you have NCBI taxIDs, go for Bracken, otherwise, check Kallisto

jfy133 commented 3 years ago

@palomo11 @jacodela I know this is a very old thread to bring up, but given the author doesn't seem to have replied, I leave this as a possible response:

I saw in the toy dataset the following commands

#!/bin/bash
echo ':::: Creating an empty database with a name "toyset"'
    sparse init --dbname toyset

echo ':::: Filling database "toyset" with 22 Salmonella complete genomes'
    sparse index --dbname toyset --seqlist Salmonella_toyset.txt

echo ':::: Building a mapping database named "Salmonella" in "toyset"'
    sparse query --dbname toyset --tag m==a | sparse mapDB --dbname toyset --mapDB Salmonella --seqlist stdin

The crucial thing I think is the --seqlist Salmonella_toyset.txt flag. This is simply the RefSeq TSV file you can download from the NCBI FTP: https://github.com/zheminzhou/SPARSE/blob/master/example/Salmonella_toyset.txt.

Presumably SPARSE will read this file to look for the location and file name. I'm guessing you could be able to 'fake' info for 'custom' genomes and as long as it follows the same column format as the RefSeq file.

Note I'm assuming this, have not tried it myself.

EDIT: looking at the output it does have NCBI taxonomy info (and downloads the NCBI taxonomy dump), however the clusters seem to be independent of this, so 'faking' the genomes might still work!