wilkelab / Opfi

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.
https://opfi.readthedocs.io/
MIT License
21 stars 5 forks source link

CRISPR arrays are being miscounted #59

Closed jimrybarski closed 4 years ago

jimrybarski commented 4 years ago

It's definitely due to having the repeat count in its name so that matching against the literal string CRISPR array never succeeds.

jimrybarski commented 4 years ago

The issue is that RuleSet.require() only does exact matches. I think the most elegant solution is to add an optional startswith flag that will only check if a Feature's name starts with a particular string. This would solve the issue with CRISPR arrays but also more generally let users do things like check for a family of genes that all start with the same string and vary by a number. This would benefit us in particular by requiring any Cas variant (e.g. require('cas12', startswith=True) would check for Cas12a, Cas12b, etc.

clauswilke commented 4 years ago

Why not write a rule based on regular expression matches? The optional argument could be regex, which would be False by default.

I.e.:

require('cas12') # only matches exactly "cas12"
require(r'^cas12[a-z]*$', regex = True) # matches "cas12", "cas12a", etc., but not "cas123" 
jimrybarski commented 4 years ago

I like that idea a lot. Although it occurs to me that this is not the right way to solve the problem at hand, since the issue with the CRISPR arrays is entirely about how the name is presented during visualization, and so the special handling of the CRISPR array name should be performed in the visualization module and not during the parsing of the pipeline data. Otherwise we'll have to go out of our way to explain that this one feature can't be used with this one particular rule and it feels janky to me.