The reason that many operons weren't being visualized is because our database contains a number of redundant copies (in one sample, up to four copies of the same file), and since the PNG filenames are based on accession IDs and operon coordinates, we are overwriting the same image file several times.
Manually inspecting a few files, they really are identical on the nucleotide level. Removing redundant files from our database would probably not be worth it. However, we can handle this at the operon_analyzer level by excluding Operon objects if their accession IDs, coordinates, and Features are all identical. This would eliminate operons that have only silent mutations or different CRISPR arrays, but I doubt such operons exist, since it would require whatever agency to have used duplicate accessions for virtually identical sequencing results.
This solves a few problems: our cluster sizes will be correct, re-BLASTing will go faster, the numbers we report in our paper will be true, and in general we'll be handling less data.
The reason that many operons weren't being visualized is because our database contains a number of redundant copies (in one sample, up to four copies of the same file), and since the PNG filenames are based on accession IDs and operon coordinates, we are overwriting the same image file several times.
Manually inspecting a few files, they really are identical on the nucleotide level. Removing redundant files from our database would probably not be worth it. However, we can handle this at the operon_analyzer level by excluding
Operon
objects if their accession IDs, coordinates, andFeature
s are all identical. This would eliminate operons that have only silent mutations or different CRISPR arrays, but I doubt such operons exist, since it would require whatever agency to have used duplicate accessions for virtually identical sequencing results.This solves a few problems: our cluster sizes will be correct, re-BLASTing will go faster, the numbers we report in our paper will be true, and in general we'll be handling less data.