wilkelab / Opfi

A Python package for discovery, annotation, and analysis of gene clusters in genomics or metagenomics data sets.
https://opfi.readthedocs.io/
MIT License
21 stars 5 forks source link

Self targeting spacer search #164

Closed jimrybarski closed 3 years ago

jimrybarski commented 3 years ago

This adds code to get all the spacer sequences and coordinates from piler-cr and perform pairwise alignments of those spacers to the parent contig. Any hits are converted to Feature objects and added to the Operon so that they can be easily visualized and subjected to rules.

The only controversial thing here (as far as I know) is that the final sequence in each array is usually truncated (for reasons unknown to me), so I extend it to the median length of the rest of the spacers in the array. I believe this is a reasonable step but if piler-cr has a good reason for doing this then we should just ignore such spacers as their sequence is so short that it's useless.

jimrybarski commented 3 years ago

Actually I'm wondering if my whole model for those broken spacers is wrong. If there's a repeat following them then the whole approach is wrong. I'd suspect that's actually an assembly artifact if that is what I find, but if so, we should be throwing out those sequences.

jimrybarski commented 3 years ago

Okay so this is the right approach though it reveals some problems with piler-cr.

Here is an example entry from some real data:

 Array 4
  46   │ >NZ_CP014912.1 Lactobacillus paracollinoides strain TMW 1.1979 chromosome, complete genome
  47   │
  48   │        Pos  Repeat     %id  Spacer  Left flank    Repeat                                  Spacer
  49   │ ==========  ======  ======  ======  ==========    ====================================    ======
  50   │    1499936      36   100.0      30  CACTTAGCTG    ....................................    ACTAAAATGCGGCACAAAATGCCCCACAAC
  51   │    1500002      36   100.0      30  GCCCCACAAC    ....................................    TTCTGCTTCCCCATTCCTTTCAACGAATCA
  52   │    1500068      36   100.0      30  CAACGAATCA    ....................................    CAGTTGCTGATGATTCAGCAGCCGAATTAA
  53   │    1500134      36   100.0      30  GCCGAATTAA    ....................................    AATTTTACAATCGCATGCAGGAACTTGGCT
  54   │    1500200      36   100.0      30  GAACTTGGCT    ....................................    CACATCATTATTAAATTTATCGCCGTTAAT
  55   │    1500266      36   100.0      30  CGCCGTTAAT    ....................................    ATCTCGGGGTACAGTTGGATTATCTCGACC
  56   │    1500332      36   100.0      30  TATCTCGACC    ....................................    ACAGTGGTGCGGTTATCGGCAAGCATAACG
  57   │    1500398      36   100.0      30  AAGCATAACG    ....................................    CGTCCACATTGACATAAGCTGCGGTGTCCT
  58   │    1500464      36   100.0          GCGGTGTCCT    ....................................    CGCTTGATACC
  59   │ ==========  ======  ======  ======  ==========    ====================================
  60   │          9      36              30                GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC

Here is the section of that contig that contains the array. I've split the text at each repeat sequence.

GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  AAGTCGGCGAGTTGTTGATCCACTTAGCTG
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  ACTAAAATGCGGCACAAAATGCCCCACAAC
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  TTCTGCTTCCCCATTCCTTTCAACGAATCA
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  CAGTTGCTGATGATTCAGCAGCCGAATTAA
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  AATTTTACAATCGCATGCAGGAACTTGGCT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  CACATCATTATTAAATTTATCGCCGTTAAT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  ATCTCGGGGTACAGTTGGATTATCTCGACC
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  ACAGTGGTGCGGTTATCGGCAAGCATAACG
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  CGTCCACATTGACATAAGCTGCGGTGTCCT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  CGCTTGATACCCTGCTCCGTCTCCTGCTCG
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  GGTGCCCGCCGGCTCTTCTTGCAAGTGCTA
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  CAGCGCCAGTTGAACCTTGTGGTTGAGTTT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  ACAAACACCGTGAAGTGAGTACCGTTCTTT
GTTTTAGATGAATATCAAATCAATGAGGTTCAAAGC  ATTATTTTATTCACAAGGTCAAACACGAAC

So the final partial spacer CGCTTGATACC that piler identified should have been extended to CGCTTGATACCCTGCTCCGTCTCCTGCTCG as I suspected, but it completely missed the very first and also the final four spacers despite the repeats being perfectly identical. Anyway, I think this should be merged but we may want to try out some other array finders.