Gene finder should be able to accept blast databases that haven't been custom annotated

wilkelab / Opfi

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.

https://opfi.readthedocs.io/

MIT License

21 stars 5 forks source link

Gene finder should be able to accept blast databases that haven't been custom annotated #110

Closed alexismhill3 closed 4 years ago

alexismhill3 commented 4 years ago

Currently the user must format/label protein references before converting them to the blast database format that Gene finder expects. Specifically, all headers must be in the format: ; the gene name is parsed out and used by Operon Analyzer to filter candidate systems.

While it is probably reasonable to expect that protein references have, minimally, an accession number and a description (most protein sequence repositories use this format anyway), re-labeling references in a heterogeneous database like NR could be very tedious. Therefore, there should be an option for Gene Finder to use the whole protein description as the gene name; it would then be up to the user to decide how to handle downstream filtering.

jimrybarski commented 4 years ago

This is going to be an interesting problem once it gets to the operon analyzer step. If there's some consistent way to identify "improperly"-formatted pipeline results, then we could apply some user-supplied function to attempt to relabel the protein (e.g. run a regex that matches all Cas genes and update the Feature object if a match occurs) and then also make it clear which genes were altered that way so the user will apply extra scrutiny to those results. Thoughts?

alexismhill3 commented 4 years ago

Yeah that could be useful; the visualizations are going to look super messy if all features are labeled with the entire protein description. To some extent that is unavoidable, but if we know that many of the genes will be cas genes (for example), it probably does make sense to try relabeling them. I can add an additional field to the output that indicates whether the gene name was parsed out (and if we don't end up needing it I'll just remove it in another update)

clauswilke commented 4 years ago

Seems somewhat complicated. As a first step, in the visualization, why not simply truncate long descriptions to the first n characters, with n = 8 or 10 or so. This will keep all generated figures somewhat legible while requiring very little development work. For interesting cases, one can always go back and manually look at the full description.