Closed jimrybarski closed 4 years ago
For some context, on our current dataset, looking at all operons with a CRISPR array, any transposase or Tn7 protein, and at least one Cas gene, 2,028,982 of 2,569,068 operons (79%) are non-redundant.
For some context, on our current dataset, looking at all operons with a CRISPR array, any transposase or Tn7 protein, and at least one Cas gene, 2,028,982 of 2,569,068 operons (79%) are non-redundant.
Okay, so the redundancy isn't too horrible then. I was a little worried that our real, non-redundant dataset was going to end up being like 25% of what we started with.
If two
Operon
objects have the same accession and start and end coordinates, and eachFeature
object in oneOperon
has a correspondingFeature
in the otherOperon
(and this must be true reciprocally) with the same name, start and end coordinates, and sequence, then only one of the twoOperon
s will be loaded when parsing gene_finder CSV data.If the
Operon
s have already been parsed, a user can alternatively just throw them into aset()
and the redundant ones will be removed.Resolves #151