wilkelab / Opfi

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.
https://opfi.readthedocs.io/
MIT License
21 stars 5 forks source link

Deduplication of operons #163

Closed jimrybarski closed 4 years ago

jimrybarski commented 4 years ago

Adds a function to remove operons that are roughly identical within the bounds of the identified features. This can result in false positives - that is, operons that in reality are unique but are flagged as redundant, however, the method is quite fast and in reality, such false positives probably vary by only a few nucleotides.

The integration test CSV contains three nucleotide sequences: the unaltered sequence, its reverse complement, and the forward sequence with a single nucleotide deleted.

Note that this mostly ignores CRISPR arrays. It makes sure they're in same order in the overall motif, but doesn't verify their exact positions or sequences. This is because pilercr is so context-sensitive that it gives different results even on an exact reverse complement.