Multiple databases incorrectly listed in interface for multi-file dbs

pansapiens commented 13 years ago

Large pre-rolled BLAST databases supplied by NCBI are conventionally split into multiple files, but are usually treated as a single database by BLAST (eg nr.00., nr.01., nr.02.* etc from here ftp://ftp.ncbi.nlm.nih.gov/blast/db/).

In the SequenceServer interface, each FILE that the database is split over is listed as a database with the same title, rather than just once as expected. For the current 'nr' database, the interface lists the same title seven times (eg title is "All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects").

Running a BLAST search by selecting the first database in the list of duplicates appears to proceed as expected, however the duplicate entries are likely to confuse users.

pansapiens commented 13 years ago

One solution would be to modify helpers.scan_blast_db to detect databases split over multiple files and only return a single database in this instance (eg by looking for duplicate titles, and maybe additionally looking at the blastcmd -list_outfmt "%d" date). It would need to be clearly documented that different databases require different titles, while databases split across multiple files should have the same title.

yannickwurm commented 13 years ago

Thanks Andrew. Indeed we hadn't tried using preformatted databases. Modifying helpers.scan_blast_db makes sense... yet another thing for the todolist!

cheers, yannick

wwood commented 13 years ago

Greeting fellow Melbournite.

Thanks for the comment. I've not run across this issue myself but haven't tried it on any of the split up fasta files anyway. What you suggest sounds sensible, when we get a chance we'll try to implement it (or of course you are free to if you like). Thanks, ben

yeban commented 13 years ago

Does formatting your database like below help with the issue?

makeblastdb -in 'Sinvicta2-2-3.prot.subset.fasta test.fasta' -dbtype prot -out multi -title test

Sinvicta2-2-3.prot.subset.fasta and test.fasta are two database formatted into one. (Actually they are the same file on my system but with different names). I get only one database entry, named test, in SequenceServer then.

wwood commented 13 years ago

sorry help what?

On 27 September 2011 11:51, Anurag Priyam < reply@reply.github.com>wrote:

Does formatting your database like below helps?
makeblastdb -in 'Sinvicta2-2-3.prot.subset.fasta test.fasta' -dbtype prot
-out multi -title test
Sinvicta2-2-3.prot.subset.fasta and test.fasta are two database formatted into one. (Actually they are the same file on my system but with different names.) I get only one database entry, named test on my side.

Reply to this email directly or view it on GitHub:

https://github.com/yannickwurm/sequenceserver/issues/54#issuecomment-2206270

Ben J Woodcroft, BE (Hons)

PhD Candidate Ralph Laboratory The University of Melbourne Melbourne, Australia

tel: (+613) 8344 2319 b.woodcroft@pgrad.unimelb.edu.au

yeban commented 13 years ago

@wwood writes: sorry help what?

Err, with the issue. What else? :).

wwood commented 13 years ago

Err, with the issue. What else? :).

How idiotic.

yeban commented 13 years ago

@wwood wites: How idiotic.

Ok :(. I thought it was evident. Anyway, I updated my comment to reflect the same.

yannickwurm commented 13 years ago

I think the issue is related to preformatted databases that one can download (eg: from NCBI.

pansapiens commented 13 years ago

Yes, the issue is with preformatted databases from NCBI and elsewhere, which are usually more convenient (and smaller downloads) than reformatting from the FASTA equivalent. The NCBI docs say they do this to keep each .tar.gz below a 1Gb download (I still think they should save some US tax dollars and offer torrents ... but that's another story).

The current (five) nr volumes can be reformatted as such (Update - doesn't work as expected, see next comment):

makeblastdb -in 'nr.00 nr.01 nr.02 nr.03 nr.04 nr.05' -dbtype prot -out nr_all -title nr -max_file_sz 100GB

I guess this is where I was heading with my original comment - should helpers.scan_blast_db do magic to deal with what is likely to be a common use case, or should the docs just help out local db administrators and tell them how to combine NCBI volumes ?

pansapiens commented 13 years ago

Update: My makeblastdb -in 'nr.00 nr.01 nr.02 nr.03 nr.04 nr.05' -dbtype prot -out nr_all -title nr -max_file_sz 100GB command eventually finished, however it does not produce a single volume database as expected ... it seems makeblastdb still limits the file size to ~2.1 Gb, such that the new db is still split across three volumes (and as a result is listed 4 times [?!]) in the web interface.

Unless I'm doing something wrong, it's looking like SequenceServer will need to handle this case internally, rather than pushing the responsibility onto the db admin. If this is a default behaviour of makeblastdb that cannot be overriden, it will eventually effect larger custom blast dbs in addition to the preformatted ones from NCBI.

wwood commented 13 years ago

it's looking like SequenceServer will need to handle this case internally

Yes, I agree. Temporarily, can you just click on one of the 4 listings and it works?

pansapiens commented 13 years ago

Yes, if I select the first database in the list of duplicates the search seems to work as expected, so this is not a showstopper bug, just a usability, cosmetic and user confidence issue.

wwood commented 13 years ago

Good to hear. Thanks

yannickwurm commented 13 years ago

Wait. I think there's a simple workaround:

Keep the downloaded files outside of sequenceserver's db dir.
Just put an ln -s alias to the "main" database file inside sequenceserver's db dir.
If that doesn't work, then copy the "main" file over to sequenceserver's db dir but change the paths within that file to reflect the true location of the .00 .01 .02... files

yeban commented 12 years ago

This issue also affects Dr. Wolfgang Rumpf (reported via email); see http://culture.no-ip.org:4567/.

yannickwurm commented 12 years ago

The following workaround should work.

Keep the downloaded files outside of sequenceserver's db dir.

Just put an ln -s alias to the "main" database file inside sequenceserver's db dir.

We could add autodetection of .00 files and spew out a bigass warning if thats the case.

wolfgangrumpf commented 12 years ago

I like the idea of letting SequenceServer handle this internally by filtering or whatever other means would work...it'll make setup easier.

wolfgangrumpf commented 12 years ago

Are we sure pointing at the first one really works? I tried this by selecting the first DB in the nt (non-redundant) series and got 78 results with a test query; repeated by selecting all of the nt DB's in the list (of that series) and got 250 hits, with no duplication I could see....

I was hitting http://culture.no-ip.org:4567/ and here is what I got:

selecting ONLY the first of "All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS,environmental samples or phase 0, 1 or 2 HTGS sequences)" I got 78 matches
selecting ALL of "All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS,environmental samples or phase 0, 1 or 2 HTGS sequences)" I got 250 matches.

I used this test sequence:

ACAAAACAAAAGCAACCTACATTGGTTTAACATGGGAAATCCTAGATAACAGACTTTTCCTGTTCCTGGG CAATAGCCTTCTGTGCTAAGATTGGTGACCTTACCTTGGTAATATTGTTGGTGCTTAGCCTTCCAGATCT AAAAACAGCTTAGTGTAATAAGTACCAAGCACAAAATTGTTAAGTTTCTCTTTTGATTGACTATGAAGAG GCTATCAAGCTTTTTTTTGTTTTGTTTTGTTTTTTGTTTTTTGAGACGGAGTCTCCCTTTGTTGCCCAGG CTGGAGTGCAGTGGCGACATCTCAGCTCACTGCAACCTCCGCCTCCCCGGTTCAACCGATTCTCCTGCCT TAGCCTCCCGAGTGGCTGGGATTACAGGCGGGTGCCACCACGCCCAGCTAATTTTTTGTATTTTCAGTAG AAACGGGGTTTCACCGTGTTAGCAAGGATGGTCTCGATCTCCTGACCTCGTGATCTGCCTGCCTCGGCCT CCCAAAGTGCTGGGATTACAGGCGTGCGCCACCACGCGCGGCTCAAGCTTTAAGTAGGTTTTTCATACAG TATTTTTTGTATTAGTTGGTGAATACAAATCTTCAATGAGCTCAGCAGAAATAAGAAATCCAAATTTCCT GATCTGCCCAGTGGTTTTATGAAACCCCAGTCCCGTTTTTTTGTTTTTTGTTTTTTTAAAAAATATTAGT AGTAGCTTTACTCAGTCTTTTCCACAAAAATAAGCTCACTGGTCAGCAAGCTCACCAGTGAGACTGGTGA GTGAGGCTGTACTTAGA

yannickwurm commented 12 years ago

Hi Wolfgang,

apologies I cannot try this out right now because my internet connection right now is too slow for downloading the precompiled dbs.

The important file is the '.nal' file (for nucleotide dbs; the protein equivalent should be .pal ). It will not automatically be the first one in the list - which would explain the discrepancy in your results.

Here for example are the complete contents of NCBI's refseq-rna .nal file:

#
# Alias file created: Dec 4, 2011  8:31 PM
#
TITLE  NCBI Transcript Reference Sequences
DBLIST refseq_rna.00 refseq_rna.01

Basically its a small text file that contains the names of the files making up the complete database. Sequenceserver should be able to see only this '.nal' file. The .00, .01 etc files should not be in sequencserver's directory.

yannickwurm commented 12 years ago

Okay I have now managed to test this. Making a soft link via ln -s is insufficient. You need to:

Keep the directory with all the .00, .01 etc files somewhere, eg : ~/db/downloadedDb
Copy the ~/db/downloadedDb/bigDatabase.nal file to your sequenceserver database directory (eg: ~/seqservDbs/bigDatabase.nal)
Edit the bigDatabase.nal file by making the file paths absolute. This means for example prepending ~/db/downloadedDb/ to all files in the DBLIST line. For the above example you would end up with the following ~/seqservDbs/bigDatabase.nal file:
```
#
# Alias file created: Dec 4, 2011  8:31 PM
#
TITLE  NCBI Transcript Reference Sequences
DBLIST /Users/moi/db/downloadedDb/refseq_rna.00 /Users/moi/db/downloadedDb/refseq_rna.01
```

Then it works.

wolfgangrumpf commented 12 years ago

Sounds more and more like SequenceServer should take care of this.. :)

wwood commented 12 years ago

Sounds more and more like SequenceServer should take care of this.. :)

+1

yannickwurm commented 12 years ago

Added to FAQ (with explanation of how to edit the.nal or .pal file. (in progress)

wolfgangrumpf commented 12 years ago

Will the next version of SequenceServer address this? And while we're talking versions, how can I tell what version I have, and what the most recent version is?

yeban commented 12 years ago

@wolfgangrumpf writes:

Will the next version of SequenceServer address this?

It will surely be addressed in an upcoming release. But there are more important issues we want to address first, for example #14, #17, and #43 or #72. Basically, they deal with user interface improvements, and ease of deployment. I am sure you will love these new features, and probably even prefer that they be addressed first :), since the workaround @yannickwurm mentioned takes care of this issue for now.

A direct answer to your question is also complicated by our versioning policy:

And while we're talking versions, how can I tell what version I have, and what the most recent version is?

For now, we follow a rolling release. Simply put, there is no strict versioning and that the latest zipball/gem is usable. Ideally, you should upgrade as frequently as possible. Here is the rationale:

While SequenceServer has become quite a hit, and we are trying to pace up the development accordingly, the current release was intended to be testing/experimental, meant mostly for early adopters. This also explains the lack of polish. We intend to release 1.0-beta soon (next 10 days or so). And with some more testing, and feedback, make it 1.0 proper. 1.0 will be the first proper release of SequenceServer, and we are really looking forward to it :D.

The version numbers in gem release don't really mean anything. It is there because RubyGems doesn't allow you to use something like 'Beta' or 'Experimental'.

To completely answer your question, 1.0 onwards we will follow a strict versioning policy, and you will be able to find out the version number and probably other details as:

# NOTE: this doesn't work now
$ sequenceserver --version

Thanks a lot for your support!

xiy commented 12 years ago

Hello gents,

I've just committed a possible fix for this to my fork here: a5f57e26ed3094363f4d6cf8ef0d4c5568f27bc4

Please test it and let me know if it works for you, the main source change is in 'lib/sequenceserver/helper.rb'. I've tested it with the "nr" database from the link pansapiens gave and it works nicely (my poor old macbook suffered a long and arduous BLAST search...)

In essence, all it does is ignore the db files that match a regular expression. It should then recognise the alias file for the split databases (.pal or .nal files) and use that as a pointer to the database as a whole so all you need to do is make sure the parts are in the same directory as the alias file.

The fix relies on the filename standard NCBI use as described here:

Large databases are formatted in multiple one-gigabytes volumes, which are named using the database.##.tar.gz convention, with ## representing the volumne number. All relevant volumes are required to reconstitute the database. An alias file is, with .nal or .pal extension, is included in the 00 volume to tie all volumes together. The database can be called using the alias name without the extension. For example, to call nt database, simply use "-d nt" in the commandline without the quotes.

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastdb.html#1

...so I haven't tested it with renamed or other non-NCBI-standard filenames (just a warning!).

pansapiens commented 12 years ago

Checked out your fork, this seems to work fine with the preformatted NCBI databases. Each database is now only listed once.

xiy commented 12 years ago

Checked out your fork, this seems to work fine with the preformatted NCBI databases. Each database is now only listed once.

Awesome - just committed a better, more accurate expression if you want to test that too ;]

wwood commented 12 years ago

Thanks a lot dudes. Just a heads up: @yeban won't be available to push this into master branch for a bit more than a week, if he is happy with it. I have to wonder though whether

line.match(/\w*.\d{2}.*/)

will also throw up some false positives. I think that the regexp can be made tighter - maybe

matches = line.match(/\w*\.\d{2}\.(...)$/) and %w(psq nsq etc etc.).include?(matches[1])

(untested). Do you have the extensions (e.g. nsq) that these files have handy?

xiy commented 12 years ago

The previous commit did improve the accuracy quite a bit but luckily, if you read the above from NCBI, alias files are essentially pointers used to represent the entire multi-volume database, so you only need to feed BLAST the alias file, which is of course the db name with no extension or volume id. This then reads the entire db in as if it's a single volume. The regexp works by simply ignoring any databases the 'blastdbcommand -list' returns that have NCBI format volume identifiers in. So for instance it returns nr, nr.00, nr.01 etc., and then when database objects are actually being created the regexp filters out all but the alias file "nr" (in this example). See the link to Rubular I posted on the commit for more info and if you want to improve it.

wwood commented 12 years ago

Ah, oops, somethow missed that second commit, stupidly (and I agree, it is better). I think we should add some logging info here to say that it has been ignored, as well.

pansapiens commented 12 years ago

Tested 8c069e214f1e362e797f119f8d436cf24ca4bbf3 ( line.match(/\/\w*[.]\d{2,}[.\w]*/) ), and it indeed works better. I also have some databases named nr70.00.p* and nr90.00.p* ... the previous regex would discard these, the new one doesn't. I agree, wwood logging what's going on would be a good idea to make the magic more transparent.

xiy commented 12 years ago

Good good.

I did initially have some logging for debug so I'll just add that back for my next commit.

xiy commented 12 years ago

Had to re-fork because GitHub messed up my repo (or maybe I did). Cleaner fix is now available here: 92012b737f74d52027d73f3116801d37a7aa7ae0.

yeban commented 12 years ago

This is weird. I tried downloading one of those massive BLAST databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ twice (nr.0{0,6}.tar.gz). And each time I get a corrupt file -- md5 checksum don't match. Any idea?

wwood commented 12 years ago

Do they have different md5 sums?

On 28 January 2012 16:31, Anurag Priyam < reply@reply.github.com

wrote:

This is weird. I tried downloading one of those massive BLAST databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ twice (nr.0{0,6}.tar.gz). And each time I get a corrupt file -- md5 checksum don't match. Any idea?

Reply to this email directly or view it on GitHub:

https://github.com/yannickwurm/sequenceserver/issues/54#issuecomment-3698457

Ben J Woodcroft, BE (Hons)

PhD Candidate Ralph Laboratory The University of Melbourne Melbourne, Australia

tel: (+613) 8344 2319 b.woodcroft@pgrad.unimelb.edu.au

wolfgangrumpf commented 12 years ago

Try again in a few hours...maybe you caught them in the midst of an update?

Cheers,

Dr. Wolfgang Rumpf Senior Product Specialist & Director of Support, ELN Technologies

Adjunct Faculty, Dept. of Biotechnology, UMUC

wolfgang.rumpf@rescentris.com wolfgang.rumpf@gmail.com

Mobile - (614) 638-6797 Skype - wolfgang.rumpf

Read my Blog at http://culture.no-ip.org/quantumthoughts

On Jan 28, 2012, at 12:31 AM, Anurag Priyamreply@reply.github.com wrote:

This is weird. I tried downloading one of those massive BLAST databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ twice (nr.0{0,6}.tar.gz). And each time I get a corrupt file -- md5 checksum don't match. Any idea?

Reply to this email directly or view it on GitHub: https://github.com/yannickwurm/sequenceserver/issues/54#issuecomment-3698457

yeban commented 12 years ago

@wwood writes:

Do they have different md5 sums?

Yep, all of them. As in, every single part of the nr.0{0,6}.tar.gz multi-part BLAST database.

@wolfgangrumpf writes:

Try again in a few hours...maybe you caught them in the midst of an update?

Umm, maybe. But I tried once yesterday and again today. The net is a little slow on my side so it anyway takes ages to download anything huge.

pansapiens commented 12 years ago

Maybe try a local mirror at http://www.bio-mirror.net/ ?

Two mirrors also offer rsync which should always give you a non-corrupted copy - you may be lucky it might also repair what you already have:

eg, in the US: rsync rsync://bio-mirror.net/biomirror/blast/nr*

or from Australia: rsync rsync://biomirror.aarnet.edu.au/biomirror/biomirror/blast/nr*

If you are really stuck I can start seeding a torrent for you at biotorrents, which should also guarantee a non-corrupt copy.

yeban commented 12 years ago

@pansapiens writes:

Maybe try a local mirror at http://www.bio-mirror.net/ ?

Ok. I will give it a shot. Though I have a very strong feeling that my copy ends up corrupted because of some weirdness in how FTP works (especially through a proxy) and that my client and the NCBI server are not communicating properly.

Two mirrors also offer rsync which should always give you a non-corrupted copy [...] If you are really stuck I can start seeding a torrent for you at biotorrents, which should also guarantee a non-corrupt copy.

Wow! Thanks a ton for so many options. My college proxy bans rsync and torrent (in fact, everything other than http, https, and ftp). Torrent is further filtered by a shitty (Trend Micro) content filtering software on top. Of course, I have workarounds ;). But still, torrents should be our last option.

yeban commented 12 years ago

yeban writes:

@pansapiens writes:

Maybe try a local mirror at http://www.bio-mirror.net/ ?

Ok. I will give it a shot.

Didn't help. I tried first two parts of the multi-part BLAST database and ended up with the same corrupt file -- md5sum of the corrupted files match. :facepalm:

Anyway, I managed to download a non-corrupt copy of nr.0{0,6}.tar.gz multi-part BLAST database by proxying the download through a different server that I have access too. Phew!

yeban commented 12 years ago

Pushed to next with some changes (SHA: 22689d97).

@xiy: Since I won't let vague comments or commit message into my repo, I modified your patch a bit. Let me know if it is acceptable to you. I will push the changes to master and the gem release then (revert otherwise :-|).

@pansapiens: Cool with the attribution in the commit message? Since you are the one with different multi-part databases with varying names (hence more test cases), if it breaks in the future (regression or otherwise) we can call upon you for some help :).

May SequenceServer be with you!

pansapiens commented 12 years ago

No problems with attribution in the commit message. I can continue to test in the future as required.

wurmlab / sequenceserver