wilkelab / Opfi

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.
https://opfi.readthedocs.io/
MIT License
21 stars 5 forks source link

Add an option to use the whole reference protein description as the gene label #112

Closed alexismhill3 closed 4 years ago

alexismhill3 commented 4 years ago

Closes #110

Also adds a bit more logic to description parsing to handle empty headers and headers that don't contain any spaces. This should cover most formatting cases; however, the pipeline still doesn't make any assumptions about what a reasonable reference ID/gene name/gene description actually looks like.

Finally, parsing out the gene name from the description is still default behavior, for backwards compatibility reasons

clauswilke commented 4 years ago

WP_117689680.1 is a RefSeq accession number, so ref|WP_117689680.1| looks good to me as a general annotation for sequences that can be anything.

alexismhill3 commented 4 years ago

Is parse_descriptions true here? In that case it's doing what it's supposed to, the output just doesn't really make sense for NR.

NR (and presumably other NCBI databases) were created using the -parse_seqids flag. It's still unclear to me exactly what this does, but one side effect is that it throws the sequence identifier (the part with the accession number) out of the header. But when I built my cas/tns databases I didn't use -parse_seqids, so in an effort to not have to go back and redo all of my dbs, I'm just going to say that for custom annotated databases, -parse_seqids shouldn't be used (FYI this isn't documented anywhere yet, so I still need to do that before this gets merged).

Anyway, if parse_descriptions=False then the pipeline doesn't parse anything; whatever the "sseqid" field is becomes the accession/id, and "stitle" is used for the gene name and gene description.