Closed pansapiens closed 12 years ago
One solution would be to modify helpers.scan_blast_db to detect databases split over multiple files and only return a single database in this instance (eg by looking for duplicate titles, and maybe additionally looking at the blastcmd -list_outfmt "%d" date). It would need to be clearly documented that different databases require different titles, while databases split across multiple files should have the same title.
Thanks Andrew. Indeed we hadn't tried using preformatted databases. Modifying helpers.scan_blast_db makes sense... yet another thing for the todolist!
cheers, yannick
Greeting fellow Melbournite.
Thanks for the comment. I've not run across this issue myself but haven't tried it on any of the split up fasta files anyway. What you suggest sounds sensible, when we get a chance we'll try to implement it (or of course you are free to if you like). Thanks, ben
Does formatting your database like below help with the issue?
makeblastdb -in 'Sinvicta2-2-3.prot.subset.fasta test.fasta' -dbtype prot -out multi -title test
Sinvicta2-2-3.prot.subset.fasta
and test.fasta
are two database formatted into one. (Actually they are the same file on my system but with different names). I get only one database entry, named test
, in SequenceServer then.
sorry help what?
On 27 September 2011 11:51, Anurag Priyam < reply@reply.github.com>wrote:
Does formatting your database like below helps?
makeblastdb -in 'Sinvicta2-2-3.prot.subset.fasta test.fasta' -dbtype prot -out multi -title test
Sinvicta2-2-3.prot.subset.fasta
andtest.fasta
are two database formatted into one. (Actually they are the same file on my system but with different names.) I get only one database entry, namedtest
on my side.Reply to this email directly or view it on GitHub:
https://github.com/yannickwurm/sequenceserver/issues/54#issuecomment-2206270
Ben J Woodcroft, BE (Hons)
PhD Candidate Ralph Laboratory The University of Melbourne Melbourne, Australia
tel: (+613) 8344 2319 b.woodcroft@pgrad.unimelb.edu.au
@wwood writes: sorry help what?
Err, with the issue. What else? :).
Err, with the issue. What else? :).
How idiotic.
@wwood wites: How idiotic.
Ok :(. I thought it was evident. Anyway, I updated my comment to reflect the same.
I think the issue is related to preformatted databases that one can download (eg: from NCBI.
Yes, the issue is with preformatted databases from NCBI and elsewhere, which are usually more convenient (and smaller downloads) than reformatting from the FASTA equivalent. The NCBI docs say they do this to keep each .tar.gz below a 1Gb download (I still think they should save some US tax dollars and offer torrents ... but that's another story).
The current (five) nr volumes can be reformatted as such (Update - doesn't work as expected, see next comment):
makeblastdb -in 'nr.00 nr.01 nr.02 nr.03 nr.04 nr.05' -dbtype prot -out nr_all -title nr -max_file_sz 100GB
I guess this is where I was heading with my original comment - should helpers.scan_blast_db do magic to deal with what is likely to be a common use case, or should the docs just help out local db administrators and tell them how to combine NCBI volumes ?
Update: My makeblastdb -in 'nr.00 nr.01 nr.02 nr.03 nr.04 nr.05' -dbtype prot -out nr_all -title nr -max_file_sz 100GB
command eventually finished, however it does not produce a single volume database as expected ... it seems makeblastdb still limits the file size to ~2.1 Gb, such that the new db is still split across three volumes (and as a result is listed 4 times [?!]) in the web interface.
Unless I'm doing something wrong, it's looking like SequenceServer will need to handle this case internally, rather than pushing the responsibility onto the db admin. If this is a default behaviour of makeblastdb that cannot be overriden, it will eventually effect larger custom blast dbs in addition to the preformatted ones from NCBI.
it's looking like SequenceServer will need to handle this case internally
Yes, I agree. Temporarily, can you just click on one of the 4 listings and it works?
Yes, if I select the first database in the list of duplicates the search seems to work as expected, so this is not a showstopper bug, just a usability, cosmetic and user confidence issue.
Good to hear. Thanks
Wait. I think there's a simple workaround:
db
dir. db
dir. This issue also affects Dr. Wolfgang Rumpf (reported via email); see http://culture.no-ip.org:4567/.
The following workaround should work.
- Keep the downloaded files outside of sequenceserver's
db
dir.- Just put an ln -s alias to the "main" database file inside sequenceserver's
db
dir.
We could add autodetection of .00 files and spew out a bigass warning if thats the case.
I like the idea of letting SequenceServer handle this internally by filtering or whatever other means would work...it'll make setup easier.
Are we sure pointing at the first one really works? I tried this by selecting the first DB in the nt (non-redundant) series and got 78 results with a test query; repeated by selecting all of the nt DB's in the list (of that series) and got 250 hits, with no duplication I could see....
I was hitting http://culture.no-ip.org:4567/ and here is what I got:
selecting ALL of "All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS,environmental samples or phase 0, 1 or 2 HTGS sequences)" I got 250 matches.
I used this test sequence:
ACAAAACAAAAGCAACCTACATTGGTTTAACATGGGAAATCCTAGATAACAGACTTTTCCTGTTCCTGGG CAATAGCCTTCTGTGCTAAGATTGGTGACCTTACCTTGGTAATATTGTTGGTGCTTAGCCTTCCAGATCT AAAAACAGCTTAGTGTAATAAGTACCAAGCACAAAATTGTTAAGTTTCTCTTTTGATTGACTATGAAGAG GCTATCAAGCTTTTTTTTGTTTTGTTTTGTTTTTTGTTTTTTGAGACGGAGTCTCCCTTTGTTGCCCAGG CTGGAGTGCAGTGGCGACATCTCAGCTCACTGCAACCTCCGCCTCCCCGGTTCAACCGATTCTCCTGCCT TAGCCTCCCGAGTGGCTGGGATTACAGGCGGGTGCCACCACGCCCAGCTAATTTTTTGTATTTTCAGTAG AAACGGGGTTTCACCGTGTTAGCAAGGATGGTCTCGATCTCCTGACCTCGTGATCTGCCTGCCTCGGCCT CCCAAAGTGCTGGGATTACAGGCGTGCGCCACCACGCGCGGCTCAAGCTTTAAGTAGGTTTTTCATACAG TATTTTTTGTATTAGTTGGTGAATACAAATCTTCAATGAGCTCAGCAGAAATAAGAAATCCAAATTTCCT GATCTGCCCAGTGGTTTTATGAAACCCCAGTCCCGTTTTTTTGTTTTTTGTTTTTTTAAAAAATATTAGT AGTAGCTTTACTCAGTCTTTTCCACAAAAATAAGCTCACTGGTCAGCAAGCTCACCAGTGAGACTGGTGA GTGAGGCTGTACTTAGA
Hi Wolfgang,
apologies I cannot try this out right now because my internet connection right now is too slow for downloading the precompiled dbs.
The important file is the '.nal' file (for nucleotide dbs; the protein equivalent should be .pal ). It will not automatically be the first one in the list - which would explain the discrepancy in your results.
Here for example are the complete contents of NCBI's refseq-rna .nal file:
#
# Alias file created: Dec 4, 2011 8:31 PM
#
TITLE NCBI Transcript Reference Sequences
DBLIST refseq_rna.00 refseq_rna.01
Basically its a small text file that contains the names of the files making up the complete database. Sequenceserver should be able to see only this '.nal' file. The .00, .01 etc files should not be in sequencserver's directory.
Okay I have now managed to test this. Making a soft link via ln -s
is insufficient. You need to:
.00
, .01
etc files somewhere, eg : ~/db/downloadedDb
~/db/downloadedDb/bigDatabase.nal
file to your sequenceserver database directory (eg: ~/seqservDbs/bigDatabase.nal
)Edit the bigDatabase.nal file by making the file paths absolute. This means for example prepending ~/db/downloadedDb/
to all files in the DBLIST
line. For the above example you would end up with the following ~/seqservDbs/bigDatabase.nal
file:
#
# Alias file created: Dec 4, 2011 8:31 PM
#
TITLE NCBI Transcript Reference Sequences
DBLIST /Users/moi/db/downloadedDb/refseq_rna.00 /Users/moi/db/downloadedDb/refseq_rna.01
Then it works.
Sounds more and more like SequenceServer should take care of this.. :)
Sounds more and more like SequenceServer should take care of this.. :)
+1
Added to FAQ (with explanation of how to edit the.nal or .pal file. (in progress)
Will the next version of SequenceServer address this? And while we're talking versions, how can I tell what version I have, and what the most recent version is?
@wolfgangrumpf writes:
Will the next version of SequenceServer address this?
It will surely be addressed in an upcoming release. But there are more important issues we want to address first, for example #14, #17, and #43 or #72. Basically, they deal with user interface improvements, and ease of deployment. I am sure you will love these new features, and probably even prefer that they be addressed first :), since the workaround @yannickwurm mentioned takes care of this issue for now.
A direct answer to your question is also complicated by our versioning policy:
And while we're talking versions, how can I tell what version I have, and what the most recent version is?
For now, we follow a rolling release. Simply put, there is no strict versioning and that the latest zipball/gem is usable. Ideally, you should upgrade as frequently as possible. Here is the rationale:
While SequenceServer has become quite a hit, and we are trying to pace up the development accordingly, the current release was intended to be testing/experimental, meant mostly for early adopters. This also explains the lack of polish. We intend to release 1.0-beta soon (next 10 days or so). And with some more testing, and feedback, make it 1.0 proper. 1.0 will be the first proper release of SequenceServer, and we are really looking forward to it :D.
The version numbers in gem release don't really mean anything. It is there because RubyGems doesn't allow you to use something like 'Beta' or 'Experimental'.
To completely answer your question, 1.0 onwards we will follow a strict versioning policy, and you will be able to find out the version number and probably other details as:
# NOTE: this doesn't work now
$ sequenceserver --version
Thanks a lot for your support!
Hello gents,
I've just committed a possible fix for this to my fork here: a5f57e26ed3094363f4d6cf8ef0d4c5568f27bc4
Please test it and let me know if it works for you, the main source change is in 'lib/sequenceserver/helper.rb'. I've tested it with the "nr" database from the link pansapiens gave and it works nicely (my poor old macbook suffered a long and arduous BLAST search...)
In essence, all it does is ignore the db files that match a regular expression. It should then recognise the alias file for the split databases (.pal or .nal files) and use that as a pointer to the database as a whole so all you need to do is make sure the parts are in the same directory as the alias file.
The fix relies on the filename standard NCBI use as described here:
Large databases are formatted in multiple one-gigabytes volumes, which are named using the database.##.tar.gz convention, with ## representing the volumne number. All relevant volumes are required to reconstitute the database. An alias file is, with .nal or .pal extension, is included in the 00 volume to tie all volumes together. The database can be called using the alias name without the extension. For example, to call nt database, simply use "-d nt" in the commandline without the quotes.
...so I haven't tested it with renamed or other non-NCBI-standard filenames (just a warning!).
Checked out your fork, this seems to work fine with the preformatted NCBI databases. Each database is now only listed once.
Checked out your fork, this seems to work fine with the preformatted NCBI databases. Each database is now only listed once.
Awesome - just committed a better, more accurate expression if you want to test that too ;]
Thanks a lot dudes. Just a heads up: @yeban won't be available to push this into master branch for a bit more than a week, if he is happy with it. I have to wonder though whether
line.match(/\w*.\d{2}.*/)
will also throw up some false positives. I think that the regexp can be made tighter - maybe
matches = line.match(/\w*\.\d{2}\.(...)$/) and %w(psq nsq etc etc.).include?(matches[1])
(untested). Do you have the extensions (e.g. nsq) that these files have handy?
The previous commit did improve the accuracy quite a bit but luckily, if you read the above from NCBI, alias files are essentially pointers used to represent the entire multi-volume database, so you only need to feed BLAST the alias file, which is of course the db name with no extension or volume id. This then reads the entire db in as if it's a single volume. The regexp works by simply ignoring any databases the 'blastdbcommand -list' returns that have NCBI format volume identifiers in. So for instance it returns nr, nr.00, nr.01 etc., and then when database objects are actually being created the regexp filters out all but the alias file "nr" (in this example). See the link to Rubular I posted on the commit for more info and if you want to improve it.
Ah, oops, somethow missed that second commit, stupidly (and I agree, it is better). I think we should add some logging info here to say that it has been ignored, as well.
Tested 8c069e214f1e362e797f119f8d436cf24ca4bbf3 ( line.match(/\/\w*[.]\d{2,}[.\w]*/)
), and it indeed works better. I also have some databases named nr70.00.p* and nr90.00.p* ... the previous regex would discard these, the new one doesn't. I agree, wwood logging what's going on would be a good idea to make the magic more transparent.
Good good.
I did initially have some logging for debug so I'll just add that back for my next commit.
Had to re-fork because GitHub messed up my repo (or maybe I did). Cleaner fix is now available here: 92012b737f74d52027d73f3116801d37a7aa7ae0.
This is weird. I tried downloading one of those massive BLAST databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ twice (nr.0{0,6}.tar.gz). And each time I get a corrupt file -- md5 checksum don't match. Any idea?
Do they have different md5 sums?
On 28 January 2012 16:31, Anurag Priyam < reply@reply.github.com
wrote:
This is weird. I tried downloading one of those massive BLAST databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ twice (nr.0{0,6}.tar.gz). And each time I get a corrupt file -- md5 checksum don't match. Any idea?
Reply to this email directly or view it on GitHub:
https://github.com/yannickwurm/sequenceserver/issues/54#issuecomment-3698457
Ben J Woodcroft, BE (Hons)
PhD Candidate Ralph Laboratory The University of Melbourne Melbourne, Australia
tel: (+613) 8344 2319 b.woodcroft@pgrad.unimelb.edu.au
Try again in a few hours...maybe you caught them in the midst of an update?
Cheers,
Dr. Wolfgang Rumpf Senior Product Specialist & Director of Support, ELN Technologies
wolfgang.rumpf@rescentris.com wolfgang.rumpf@gmail.com
On Jan 28, 2012, at 12:31 AM, Anurag Priyamreply@reply.github.com wrote:
This is weird. I tried downloading one of those massive BLAST databases from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ twice (nr.0{0,6}.tar.gz). And each time I get a corrupt file -- md5 checksum don't match. Any idea?
Reply to this email directly or view it on GitHub: https://github.com/yannickwurm/sequenceserver/issues/54#issuecomment-3698457
@wwood writes:
Do they have different md5 sums?
Yep, all of them. As in, every single part of the nr.0{0,6}.tar.gz multi-part BLAST database.
@wolfgangrumpf writes:
Try again in a few hours...maybe you caught them in the midst of an update?
Umm, maybe. But I tried once yesterday and again today. The net is a little slow on my side so it anyway takes ages to download anything huge.
Maybe try a local mirror at http://www.bio-mirror.net/ ?
Two mirrors also offer rsync which should always give you a non-corrupted copy - you may be lucky it might also repair what you already have:
eg, in the US: rsync rsync://bio-mirror.net/biomirror/blast/nr*
or from Australia: rsync rsync://biomirror.aarnet.edu.au/biomirror/biomirror/blast/nr*
If you are really stuck I can start seeding a torrent for you at biotorrents, which should also guarantee a non-corrupt copy.
@pansapiens writes:
Maybe try a local mirror at http://www.bio-mirror.net/ ?
Ok. I will give it a shot. Though I have a very strong feeling that my copy ends up corrupted because of some weirdness in how FTP works (especially through a proxy) and that my client and the NCBI server are not communicating properly.
Two mirrors also offer rsync which should always give you a non-corrupted copy [...] If you are really stuck I can start seeding a torrent for you at biotorrents, which should also guarantee a non-corrupt copy.
Wow! Thanks a ton for so many options. My college proxy bans rsync and torrent (in fact, everything other than http, https, and ftp). Torrent is further filtered by a shitty (Trend Micro) content filtering software on top. Of course, I have workarounds ;). But still, torrents should be our last option.
yeban writes:
@pansapiens writes:
Maybe try a local mirror at http://www.bio-mirror.net/ ?
Ok. I will give it a shot.
Didn't help. I tried first two parts of the multi-part BLAST database and ended up with the same corrupt file -- md5sum of the corrupted files match. :facepalm:
Anyway, I managed to download a non-corrupt copy of nr.0{0,6}.tar.gz multi-part BLAST database by proxying the download through a different server that I have access too. Phew!
Pushed to next
with some changes (SHA: 22689d97).
@xiy: Since I won't let vague comments or commit message into my repo, I modified your patch a bit. Let me know if it is acceptable to you. I will push the changes to master
and the gem release then (revert otherwise :-|).
@pansapiens: Cool with the attribution in the commit message? Since you are the one with different multi-part databases with varying names (hence more test cases), if it breaks in the future (regression or otherwise) we can call upon you for some help :).
May SequenceServer be with you!
No problems with attribution in the commit message. I can continue to test in the future as required.
Large pre-rolled BLAST databases supplied by NCBI are conventionally split into multiple files, but are usually treated as a single database by BLAST (eg nr.00., nr.01., nr.02.* etc from here ftp://ftp.ncbi.nlm.nih.gov/blast/db/).
In the SequenceServer interface, each FILE that the database is split over is listed as a database with the same title, rather than just once as expected. For the current 'nr' database, the interface lists the same title seven times (eg title is "All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects").
Running a BLAST search by selecting the first database in the list of duplicates appears to proceed as expected, however the duplicate entries are likely to confuse users.