wurmlab / sequenceserver

Intuitive graphical web interface for running BLAST bioinformatics tool (i.e. have your own custom NCBI BLAST site!)
https://sequenceserver.com
GNU Affero General Public License v3.0
268 stars 111 forks source link

Protein domains in protein queries #693

Closed yannickwurm closed 10 months ago

yannickwurm commented 11 months ago

NCBI Blast shows CDD hit domain analysis on protein queries. This is super useful and also biologically informative (e..g, 'which functional part of my gene is conserved")? Screenshot 2023-10-13 at 13 39 38

Those pictures come from "rpsblast" alignment of precomputed protein domain matrixes (README): The relevant output can be obtained using:

We could:

Screenshot 2023-10-13 at 13 52 17

yannickwurm commented 10 months ago

Hi @tadast - a bit more info here. Apologies for the delay

No need to install anything special. Here I just downloaded Cdd.tar.gz and decompressed it. Doesn't work if path has weird chars in it.

Example:

cat ~/.sequenceserver/minidb/SI_putativeTranscripts.fasta | seqtk seq -a | head -n 30 > test.fasta
rpstblastn -query test.fasta -db Cdd  -outfmt 7 -num_threads 8 -evalue 1.0e-5 -max_target_seqs 10 > test.rpstblastn.cdd.tab
cat test.rpstblastn.cdd.tab

Output:

# RPSTBLASTN 2.14.0+
# Query: SiJWA01AAW.scf
# Database: Cdd
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
SiJWA01AAW.scf  CDD:436463  22.785  158 116 2   14  478 100 254 4.63e-24    95.2
# RPSTBLASTN 2.14.0+
# Query: SiJWA01AAX.scf
# Database: Cdd
# 0 hits found
# RPSTBLASTN 2.14.0+
# Query: SiJWA01ACE.scf
# Database: Cdd
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 10 hits found
SiJWA01ACE.scf  CDD:395170  53.968  63  29  0   243 431 1   63  1.25e-24    91.0
SiJWA01ACE.scf  CDD:237664  43.243  74  42  0   231 452 1   74  1.31e-21    90.2
SiJWA01ACE.scf  CDD:237660  48.649  74  38  0   231 452 2   75  1.50e-21    90.3
SiJWA01ACE.scf  CDD:236757  50.794  63  31  0   243 431 5   67  5.32e-21    88.7
SiJWA01ACE.scf  CDD:223560  50.769  65  32  0   237 431 3   67  1.95e-20    87.3
SiJWA01ACE.scf  CDD:184599  47.826  69  36  0   243 449 6   74  2.73e-20    86.8

# and so on. 

Just like with normal blast, we have different -outfmt options including json.

the Query-start and query_end coordinates are the regions of the query sequence we want to highlight. (e.g. on the first image above, those would be ~250 to 600).

The human-friendly description of the CDD domain is likely visible int he long table output... or in the JSON/XML outputs...

yannickwurm commented 10 months ago

Cloud users now have this. 🙌

Example: cdd-alignment-overview

And in this BLAST output:

blast-report-including-niemann-immune-domain-annotated copy