wilkelab / Opfi

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.
https://opfi.readthedocs.io/
MIT License
21 stars 5 forks source link

Eliminate redundant operons #151

Closed jimrybarski closed 4 years ago

jimrybarski commented 4 years ago

The reason that many operons weren't being visualized is because our database contains a number of redundant copies (in one sample, up to four copies of the same file), and since the PNG filenames are based on accession IDs and operon coordinates, we are overwriting the same image file several times.

Manually inspecting a few files, they really are identical on the nucleotide level. Removing redundant files from our database would probably not be worth it. However, we can handle this at the operon_analyzer level by excluding Operon objects if their accession IDs, coordinates, and Features are all identical. This would eliminate operons that have only silent mutations or different CRISPR arrays, but I doubt such operons exist, since it would require whatever agency to have used duplicate accessions for virtually identical sequencing results.

This solves a few problems: our cluster sizes will be correct, re-BLASTing will go faster, the numbers we report in our paper will be true, and in general we'll be handling less data.