Closed alexismhill3 closed 4 years ago
WP_117689680.1
is a RefSeq accession number, so ref|WP_117689680.1|
looks good to me as a general annotation for sequences that can be anything.
Is parse_descriptions
true here? In that case it's doing what it's supposed to, the output just doesn't really make sense for NR.
NR (and presumably other NCBI databases) were created using the -parse_seqids flag. It's still unclear to me exactly what this does, but one side effect is that it throws the sequence identifier (the part with the accession number) out of the header. But when I built my cas/tns databases I didn't use -parse_seqids, so in an effort to not have to go back and redo all of my dbs, I'm just going to say that for custom annotated databases, -parse_seqids shouldn't be used (FYI this isn't documented anywhere yet, so I still need to do that before this gets merged).
Anyway, if parse_descriptions=False
then the pipeline doesn't parse anything; whatever the "sseqid" field is becomes the accession/id, and "stitle" is used for the gene name and gene description.
Closes #110
Also adds a bit more logic to description parsing to handle empty headers and headers that don't contain any spaces. This should cover most formatting cases; however, the pipeline still doesn't make any assumptions about what a reasonable reference ID/gene name/gene description actually looks like.
Finally, parsing out the gene name from the description is still default behavior, for backwards compatibility reasons