Straightforward hyperlinking to non-standard BLAST hits

wwood commented 13 years ago

Hi,

When a BLAST result is returned with hits, the hit names currently either:

hyperlink to a page '/getsequence/:/:_' if the blast database has been formatted with -parseq_seqids
otherwise plaintext with no hyperlink

If sequence server is to be part of a larger server framework (say the server has a blast server and then a page for each gene with more details than just a fasta file), then it would be good to have a blessed way of substituting the hyperlink with something more custom.

One idea for implementing this is to extract the code for creating the hyperlink and put it into a place where the sequenceserver help documents say "if you want custom hyperlinks, change this part of the code"

The default server's code will be

link_to_fasta = "/get_sequence/:#{id}/:#{string_of_used_databases}" # several dbs... separate by ' '

and then below that have a (by default) commented out example line

#link_to_fasta = "http://randomserver.org/gene/#{id}"

Am I making any sense? I'm happy to code this part if you agree it is a good idea. I can think of at least two different projects I'm working on where this would be of use.

Thanks, ben

yannickwurm commented 13 years ago

That would be great!

On 26 May 2011, at 12:06, wwood wrote:

Hi,

When a BLAST result is returned with hits, the hit names currently either:

hyperlink to a page '/getsequence/:/:_' if the blast database has been formatted with -parseq_seqids

otherwise plaintext with no hyperlink

If sequence server is to be part of a larger server framework (say the server has a blast server and then a page for each gene with more details than just a fasta file), then it would be good to have a blessed way of substituting the hyperlink with something more custom.

One idea for implementing this is to extract the code for creating the hyperlink and put it into a place where the sequenceserver help documents say "if you want custom hyperlinks, change this part of the code"

The default server's code will be

link_to_fasta = "/get_sequence/:#{id}/:#{string_of_used_databases}" # several dbs... separate by ' '

and then below that have a (by default) commented out example line

link_to_fasta = "http://randomserver.org/gene/#{id}"

Am I making any sense? I'm happy to code this part if you agree it is a good idea. I can think of at least two different projects I'm working on where this would be of use.

Thanks, ben

Reply to this email directly or view it on GitHub: https://github.com/yannickwurm/sequenceserver/issues/23

wwood commented 13 years ago

I've written a patch to implement custom hyperlinking at https://github.com/wwood/sequenceserver/commit/67dbbb2346 - let me know what you think. To get a custom link to work, uncomment the method in the new file customisation.rb.

I was thinking though, this code will be mostly replaced if the graphic is included and the default blast output is no longer parsed. So far the method only makes available the sequence_id, and not the database that it hits. Also, there is no custom overall link possible like the "view fasta file with all hits in it" link that happens by default on -parse_seqids formatted blast databases.

Thanks, ben

yannickwurm commented 13 years ago

Thanks Ben,

On 26 May 2011, at 21:40, wwood wrote:

I've written a patch to implement custom hyperlinking at https://github.com/wwood/sequenceserver/commit/67dbbb2346 - let me know what you think. To get a custom link to work, uncomment the method in the new file customisation.rb.

I was thinking though, this code will be mostly replaced if the graphic is included and the default blast output is no longer parsed.

So far the method only makes available the sequence_id, and not the database that it hits. Also, there is no custom overall link possible like the "view fasta file with all hits in it" link that happens by default on -parse_seqids formatted blast databases.

yeah, I don't know how important that "fasta with all hits" link is... We could store the ids in a universally accessible temporary array...

yannick (down with dengue - not very functional!)

Thanks, ben

Reply to this email directly or view it on GitHub: https://github.com/yannickwurm/sequenceserver/issues/23#comment_1242739

wwood commented 13 years ago

Get well soon.. will comment further later.

wwood commented 13 years ago

Thanks for taking the time to review my code.

I have implemented some of Yannick's suggestions in the same branch: https://github.com/wwood/sequenceserver/compare/67dbbb23...df26bbe

By universally accessible temporary array, do you mean an instance (code>@variable</code) variable? From the sinatra docs, it seems that instance variables are request-specific, so would suit our situation? http://www.sinatrarb.com/intro#Request/Instance%20Scope I've done this for code>@all_retreivable_ids</code so that the fasta file of all hits can be generated.

We should put some thought into creating an interface that will remain stable. One particular idea that might impact this stability is support for sequence-similarity algorithms that are not BLAST e.g. exonerate. When non-bioinformatics biologists say that they want to use BLAST, they often mean that they want to do a sequence similarity search. Given that BLAST is not the be-all-and-end-all of sequence similarity, if it is dead easy for people to start to use other programs then they just might do that. Maybe suitable for a plugin-style way of thinking?

Maybe plugins are going a bit far into the future. But I think it is still something we should keep in mind when designing the url generation API (ie this branch). How does this sound for a method definition?

construct_custom_sequence_hyperlink(options={})

options: things gleaned from the algorithm, either from the parameters used for the algorithm (e.g. the databases searched) or things parsed from the output (e.g. each hit). Perhaps some standards:

options[:sequence_id]: The sequences we are generating the URL for.
options[:databases]: the databases/fasta files that were 'searched' by the algorithm e.g. BLAST databases.

That method's returns a url, although it may also change instance variables like @all_retrievable_ids.

By keeping that interface somewhat vague like this, then we are less likely to break things later? Changing the number of arguments later down the line is probably a bad idea because then people's URL generation code will fail with WrongNumberOfArgumentsError or whatever it's called.

While a plugin system would be cool, other issues are more pressing at the moment. I just wanted to try to be thoughtful about the design of that method, and get this branch merged.

Apologies for the length, ben

yannickwurm commented 13 years ago

This looks good to me. Any comments anurag?

cheers yannick

On 29 May 2011, at 13:36, wwood wrote:

Thanks for taking the time to review my code.

I have implemented some of Yannick's suggestions in the same branch: https://github.com/wwood/sequenceserver/compare/67dbbb23...df26bbe

By universally accessible temporary array, do you mean an instance (@variable) variable? From the sinatra docs, it seems that instance variables are request-specific, so would suit our situation? http://www.sinatrarb.com/intro#Request/Instance%20Scope I've done this for code>@all_retreivable_ids</code so that the fasta file of all hits can be generated.

We should put some thought into creating an interface that will remain stable. One particular idea that might impact this stability is support for sequence-similarity algorithms that are not BLAST e.g. exonerate. When non-bioinformatics biologists say that they want to use BLAST, they often mean that they want to do a sequence similarity search. Given that BLAST is not the be-all-and-end-all of sequence similarity, if it is dead easy for people to start to use other programs then they just might do that. Maybe suitable for a plugin-style way of thinking?

Maybe plugins are going a bit far into the future. But I think it is still something we should keep in mind when designing the url generation API (ie this branch). How does this sound for a method definition?

construct_custom_sequence_hyperlink(options={})

options: things gleaned from the algorithm, either from the parameters used for the algorithm (e.g. the databases searched) or things parsed from the output (e.g. each hit). Perhaps some standards:

options[:sequence_id]: The sequences we are generating the URL for.

options[:databases]: the databases/fasta files that were 'searched' by the algorithm e.g. BLAST databases.

That method's returns a url, although it may also change instance variables like @all_retrievable_ids.

By keeping that interface somewhat vague like this, then we are less likely to break things later? Changing the number of arguments later down the line is probably a bad idea because then people's URL generation code will fail with WrongNumberOfArgumentsError or whatever it's called.

While a plugin system would be cool, other issues are more pressing at the moment. I just wanted to try to be thoughtful about the design of that method, and get this branch merged.

Apologies for the length, ben

Reply to this email directly or view it on GitHub: https://github.com/yannickwurm/sequenceserver/issues/23#comment_1256144

wwood commented 13 years ago

Implemented in https://github.com/wwood/sequenceserver/commit/b888534f67a

I also added a second custom method that allows the user to change the whole of the output line, not just the hyperlink. This means that multiple hyperlinks are now possible if by uncommenting and modifying the new method construct_custom_sequence_hyperlinking_line in customisation.rb.

Unless there are any other suggestions I feel this branch is ready to be merged into mainline. Agree?

Thanks, ben

yeban commented 13 years ago

Please could you rebase your branch of master?

yannickwurm commented 13 years ago

Looks great!

On 5 Jun 2011, at 20:35, wwood wrote:

Implemented in https://github.com/wwood/sequenceserver/commit/b888534f67a

I also added a second custom method that allows the user to change the whole of the output line, not just the hyperlink. This means that multiple hyperlinks are now possible if by uncommenting and modifying the new method construct_custom_sequence_hyperlinking_line in customisation.rb.

Unless there are any other suggestions I feel this branch is ready to be merged into mainline. Agree?

Thanks, ben

Reply to this email directly or view it on GitHub: https://github.com/yannickwurm/sequenceserver/issues/23#comment_1304558

wwood commented 13 years ago

I've just rebased this now in the https://github.com/wwood/sequenceserver/commits/custom_hyperlinks after fixing the tests.

I believe I have done it correctly, but this is the first time I've done a rebase so please tell me if I've made any mistakes. Thanks, ben

yeban commented 13 years ago

Thanks Ben, I will have a look and get back to you on this in a while.

yeban commented 13 years ago

Is this code working for you guys? A portion of debug output says:

WARN -- : Unable to parse sequence id `lcl|SI2.2.0_10541 locus=Si_gnF.scaffold06343[1909081..1909999].pep_2 quality=100.00'

What is wrong with that sequence id? Why does it have to confirm to /^\s*psu\|(\S+) /?

I have a feeling that we can make it simpler.

wwood commented 13 years ago

sequence IDs don't have to conform to /^\s*psu|(\S+) / - that is just an example of what can be done with non-standard IDs (the particular example comes from PlasmoDB). Seems that is not obvious enough though. Given you are more new to that portion of the code than me, would you mind writing more sensible comments?

I think we need to write a how-to in the wiki too as well - how about one of you two write that and in exchange I'll write it for some other page on code that you've written?

Do you have any suggestions about how to make it simpler while keeping the same functionality?

yeban commented 13 years ago

It was not obvious to me because I don't understand BLAST's output really well. But to 'admin-biologists' it would be more apparent, no? I am still a bit a apprehensive about people editing code, and writing regexp to make it work. Isn't there a simple way to get id more easily.

With most of Ben's code, it should be possible to derive custom URL's from config.yml too. That would alleviate the need for people to edit Ruby code though.

Something like:

# for simple links
id:  /^\s*psu|(\S+) /
url: "http://apiloc.bio21.unimelb.edu.au/apiloc/gene/%d" #%d gets interpolated with the id

Thoughts?

wwood commented 13 years ago

I like the idea of making it a configuration thing rather than a code thing, but that solution:

Increases complexity if we are to support both methods (for us and the users), since it involves creation of a domain specific language (i.e. the %d bit).
may not be flexible enough. What happens if you want to refer to the databases, or use multiple captures from the regex etc.
involves writing and testing more code. I'd rather not do that for this branch at least. Your solution does cover the most common use case though.

yannickwurm commented 13 years ago

The idea of having something configurable in the yaml is tempting.

But it will be a lot of work to implement something general enough... which is likely to break). Additionally, an admin who wants to add links will need to be able to understand regexps and parsing... basically they need to know how to code. Thus we might as well use our time otherwise and let admins edit Ben's "easy-to-edit" method.

Admins also need to customize the .erb output file (for header & footer & meta tags & title). I think thats fine.

On 10 Jun 2011, at 07:38, wwood wrote:

I like the idea of making it a configuration thing rather than a code thing, but that solution:

Increases complexity if we are to support both methods (for us and the users), since it involves creation of a domain specific language (i.e. the %d bit).

may not be flexible enough. What happens if you want to refer to the databases, or use multiple captures from the regex etc.

involves writing and testing more code. I'd rather not do that for this branch at least. Your solution does cover the most common use case though.

Reply to this email directly or view it on GitHub: https://github.com/yannickwurm/sequenceserver/issues/23#comment_1338977

yeban commented 13 years ago

... since it involves creation of a domain specific language (i.e. the %d bit).

%d is one of Ruby's string interpolation operator.

The simple use case I mentioned is actually quite simple to implement, but I don't have time to look at it now. Besides, Ben's work is good. So for now I will merge this. We can ofcourse modify it later to make it even better, and/or simpler.

yeban commented 13 years ago

So for now I will merge this.

Slightly modified though, https://github.com/yannickwurm/sequenceserver/commits/custom_hyperlinks.

On a related note, lets stick to the convention of using only two spaces throughout. No tabs. And please. Git commit messages - first line describing the change in real short (80 chars), followed by a blank line, followed by whatever details you want to add (optional). Look at (ahem) my commit messages :D.

wwood commented 13 years ago

Sorry about the spaces. My IDE setup is quite horrible. I agree about the spaces. I will change the way I do commit messages.

wurmlab / sequenceserver

Straightforward hyperlinking to non-standard BLAST hits #23

link_to_fasta = "http://randomserver.org/gene/#{id}"