wurmlab / sequenceserver

Intuitive graphical web interface for running BLAST bioinformatics tool (i.e. have your own custom NCBI BLAST site!)
https://sequenceserver.com
GNU Affero General Public License v3.0
268 stars 112 forks source link

Improve FASTA Download #485

Open enuggetry opened 3 years ago

enuggetry commented 3 years ago

Currently, FASTA download can take a long time to process before downloading takes place. This is understandable given the use of blastdbcmd to extract the FASTA data. However, during the processing, the app is hung while waiting for the processing. I'd like to see it display a progress indicator while it is waiting, as it can take several minutes to process. Or, it can somehow do this processing in the background. (not a high priority. Just a suggestion)

yeban commented 3 years ago

I agree the UI shouldn't hang while waiting for the download and should probably show a progress indicator as the download can indeed take a while. I would also like any error during download to be displayed in the standard error modal instead of navigating to a new page.

I think this is a kind of enhancement that can be included in a point release after 2.0 stable.

lukaszsobala commented 1 month ago

Hello,

I don't know how exactly sequenceserver uses blast commands to download the full FASTA files, but this seems to be possible to improve. From what I understand, right now it generates a single command line containing all the sequence IDs it wants, and it sometimes fails if the command is too long (correct me if I am wrong). I cannot find out how many it is in the current version (sequenceserver 3.1.2), I tried up to 3500 hits.

What do you think about changing this behaviour in favour of using blastdbcmd (which supposedly can be used with multiple databases) and -entry_batch with a list of identifiers to download? Right now (at least for me) it is an extra step to download all the results if there are too many to download from the sequenceserver web gui.

Or one could think of running blastdbcmd multiple times in parallel (for example, sequentially akin to GNU parallel using as many threads as the user defined) as a fallback if the number of sequences is too large for a single download, then combining the output.

If this is impossible to introduce, one thing that could be improved is doing something like sort|uniq on the FASTA sequence identifiers. This way, in case of many query sequences, only the non-redundant hits would be downloaded. The changed order of hits would not matter when downloading all of them in such a case, because sequenceserver returns a single multi-FASTA file anyway.

yannickwurm commented 1 month ago

Hey @lukaszsobala We actually should have resolved this last week: https://github.com/wurmlab/sequenceserver/pull/754 This should be in 3.1.1 and 3.1.2 and I think addresses your issue?

Are you able to test with the newer version?

Cheers Yannick

lukaszsobala commented 1 month ago

@yannickwurm

Oh, thank you! It usually failed around >2500 hits and I was surprised it dealt with >3500 this time (indeed I am using version 3.1.2 now). I assumed this was still unsolved since I found this issue as open.

What do you think about sort|uniqing the results of multi-query blast?

Cheers, Łukasz

yannickwurm commented 1 month ago

Hey @joko3ono, @lukaszsobala has a good point: some of the identifiers will be redundant. Could you please add a redundancy removal step? (probably better on the ruby side than in the unix command line) Cheers! Yannick Ps: Łukasz you're right we should have closed this issue!