wilkelab / Opfi

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.
https://opfi.readthedocs.io/
MIT License
21 stars 5 forks source link

Prevent redundant operons from being loaded #152

Closed jimrybarski closed 4 years ago

jimrybarski commented 4 years ago

If two Operon objects have the same accession and start and end coordinates, and each Feature object in one Operon has a corresponding Feature in the other Operon (and this must be true reciprocally) with the same name, start and end coordinates, and sequence, then only one of the two Operons will be loaded when parsing gene_finder CSV data.

If the Operons have already been parsed, a user can alternatively just throw them into a set() and the redundant ones will be removed.

Resolves #151

jimrybarski commented 4 years ago

For some context, on our current dataset, looking at all operons with a CRISPR array, any transposase or Tn7 protein, and at least one Cas gene, 2,028,982 of 2,569,068 operons (79%) are non-redundant.

alexismhill3 commented 4 years ago

For some context, on our current dataset, looking at all operons with a CRISPR array, any transposase or Tn7 protein, and at least one Cas gene, 2,028,982 of 2,569,068 operons (79%) are non-redundant.

Okay, so the redundancy isn't too horrible then. I was a little worried that our real, non-redundant dataset was going to end up being like 25% of what we started with.