Closed jimrybarski closed 4 years ago
Actually I'm wondering if my whole model for those broken spacers is wrong. If there's a repeat following them then the whole approach is wrong. I'd suspect that's actually an assembly artifact if that is what I find, but if so, we should be throwing out those sequences.
Okay so this is the right approach though it reveals some problems with piler-cr.
Here is an example entry from some real data:
Array 4
46 │ >NZ_CP014912.1 Lactobacillus paracollinoides strain TMW 1.1979 chromosome, complete genome
47 │
48 │ Pos Repeat %id Spacer Left flank Repeat Spacer
49 │ ========== ====== ====== ====== ========== ==================================== ======
50 │ 1499936 36 100.0 30 CACTTAGCTG .................................... ACTAAAATGCGGCACAAAATGCCCCACAAC
51 │ 1500002 36 100.0 30 GCCCCACAAC .................................... TTCTGCTTCCCCATTCCTTTCAACGAATCA
52 │ 1500068 36 100.0 30 CAACGAATCA .................................... CAGTTGCTGATGATTCAGCAGCCGAATTAA
53 │ 1500134 36 100.0 30 GCCGAATTAA .................................... AATTTTACAATCGCATGCAGGAACTTGGCT
54 │ 1500200 36 100.0 30 GAACTTGGCT .................................... CACATCATTATTAAATTTATCGCCGTTAAT
55 │ 1500266 36 100.0 30 CGCCGTTAAT .................................... ATCTCGGGGTACAGTTGGATTATCTCGACC
56 │ 1500332 36 100.0 30 TATCTCGACC .................................... ACAGTGGTGCGGTTATCGGCAAGCATAACG
57 │ 1500398 36 100.0 30 AAGCATAACG .................................... CGTCCACATTGACATAAGCTGCGGTGTCCT
58 │ 1500464 36 100.0 GCGGTGTCCT .................................... CGCTTGATACC
59 │ ========== ====== ====== ====== ========== ====================================
60 │ 9 36 30 GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC
Here is the section of that contig that contains the array. I've split the text at each repeat sequence.
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC AAGTCGGCGAGTTGTTGATCCACTTAGCTG
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC ACTAAAATGCGGCACAAAATGCCCCACAAC
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC TTCTGCTTCCCCATTCCTTTCAACGAATCA
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC CAGTTGCTGATGATTCAGCAGCCGAATTAA
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC AATTTTACAATCGCATGCAGGAACTTGGCT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC CACATCATTATTAAATTTATCGCCGTTAAT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC ATCTCGGGGTACAGTTGGATTATCTCGACC
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC ACAGTGGTGCGGTTATCGGCAAGCATAACG
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC CGTCCACATTGACATAAGCTGCGGTGTCCT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC CGCTTGATACCCTGCTCCGTCTCCTGCTCG
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC GGTGCCCGCCGGCTCTTCTTGCAAGTGCTA
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC CAGCGCCAGTTGAACCTTGTGGTTGAGTTT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC ACAAACACCGTGAAGTGAGTACCGTTCTTT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC ATTATTTTATTCACAAGGTCAAACACGAAC
So the final partial spacer CGCTTGATACC
that piler identified should have been extended to CGCTTGATACCCTGCTCCGTCTCCTGCTCG
as I suspected, but it completely missed the very first and also the final four spacers despite the repeats being perfectly identical. Anyway, I think this should be merged but we may want to try out some other array finders.
This adds code to get all the spacer sequences and coordinates from piler-cr and perform pairwise alignments of those spacers to the parent contig. Any hits are converted to
Feature
objects and added to theOperon
so that they can be easily visualized and subjected to rules.The only controversial thing here (as far as I know) is that the final sequence in each array is usually truncated (for reasons unknown to me), so I extend it to the median length of the rest of the spacers in the array. I believe this is a reasonable step but if piler-cr has a good reason for doing this then we should just ignore such spacers as their sequence is so short that it's useless.