xiaolinfrank / batch_4DTV_calculation

Calculate
4 stars 2 forks source link

batch_4DTV_calculation

Calculate 4DTV (transversion rate on 4-fold degenerated sites) faster with batch axt files

Cite this script (batch_4DTV_calculation.pl):

Jianguo Lu, Peilin Huang, Jialiang Sun, Jian Liu, DupScan: predicting and visualizing vertebrate genome duplication database, Nucleic Acids Research, 2022;, gkac718, https://doi.org/10.1093/nar/gkac718

This script was based on the "calculate_4DTV_correction.pl"

The old script can only calculate 4DTV for a pair of sequences at a time, which are contained in an axt file. In general, there is nothing wrong with this approach. However, in the case of heavy computation, this not only slows down the progress, but also tends to cause the process to crash.

Merge axt files

If you have many axt files:

example1.axt

seq1-seq2
ATGTCTCATATGTCTTCTGTGAACGCGAAAAATCTTCAAAAACTAGCAGATTCAATTGTC
AAACATGTAAAGCACTTTAACAATAATGAAGTTTTGTGTCTGATCAAACTCTTCAATGTG
CTGATGGGAGAGCAGAGCGAGCACAGGGTTGGAAATGGACTGGATCGTGGTAAATTCAGG
AGCATCCTCCACAACACATTTGGAATGACAGATGACATGATTATGGACAGAGTCTTCCGT
GCATTTGACAAGGACAATGATAGCAACGTCAGTGTAAAAGAATGGATAGAAGGACTTTCA
GTGTTTCTGCGAGGGACCTTGgatgaaaaaattaaatATTGTTTTGAGGTTTATGACTTA
AATGGGGATGGATATATTTCACGAGAAGAGATGTTCCACATGCTGAAAAACAGTCTCATA
AAACAACCAACAGAAGAAGATCCAGATGAAGGGATTAAGGACTTGGTAGAGATTACTCTT
AAAAAGATGgaCCACGATCACGACAGCAGACTTTCATACGCTGATTTTGAGAAAGCAGTA
AAAGAAGAAAATCTCTTGCTTGAGGCTTTTGGAGCTTGTCTTCCTGATGCAAAGagtaTT
CTTGCTTTTGAAAGACAGGCCTTCCAG------GATACCACAGAAAAT
atgctgaaaatgTCGGCGATGAACAGAAAATTAATTCAAAACCTCGCCGAGACTTTATGC
AGACAAGTCAAACATTTTAATAAAACAGAGACGGAGTGTCTGATAAGGCTGTTCAACAGT
CTGCTGGGAGAGCAGGCAGAGAGAAAGACGACTATTGGAGTGGACCGGGCCAAATTCAGA
AATATACTGCACCACACTTTCGGGATGACCGACGACATGATGACGGACAGAGTTTGTCGT
GTCATTGACAAGGACAACGATGGCTACTTAAGCGTTAAAGAGTGGGTTGAggctctgtct
gtctttctaagAGGCACACTGGATGAAAAAATGAAATaCTGTTTTGAGGTGTATGACCTG
AACGGGGATGGATACATCTCACGTGAGGAGATGTTTCAGATGCTGAAAGACAGCCTCATC
AGGCAGCCCACCGAAGAGGATCCTGATGAGGGGATTAAGGATATTGTGGAGATTGCCTTG
AAAAAAATGGATTATGACCATGATGGAAGAGTTTCTTATGCTGATTTTGAGAAGACGGTC
ATGGATGAAAACCTTTTACTAGAAGCTTTTGGAAACTGCCTTCCTGATGCAAAGAGTGTA
CTAGCATTTGAGCAACAGGCATTCCAGAAACACGAACACTGCAAAGAA

example2.axt

seq3-seq4
ATGGATCGCCATTCCAAtttaatttccatttggctgcaACTGGAACTGTGTGCCATGGCA
GTACTTCTGGCAAAAGGGGAGATAAGATGCTACTGTGATGCAGCGCATTGTGTGGCAACA
GGTTACATGTGTAAATCCGAGTTAAATGCCTGCTTCACCAGGCTTCTGGACCCACAGAAC
ACAAACTCCCCTCTCACGCATGGCTGCTTGGACCCGACTGCAAACACAGCAGATGTTTGC
CATGCTGGAAGGACAGAGAGCCGCGCTGGGGCCTCGGAGAAGCTTGAGTGCTGTCACGAC
GATATGTGCAATTACAGAGGACTCCATGATGTTGTTTCATATCCCAGGGGGGACAGCTCA
GATCATGGAACAAGATATCAGCCAGACAGTAGCAGGAATCTTCTGACCAGGGTTCAGGAT
TTAACATCCTCTAAAGAGCTGTGGTTCAGAGCAGCCGTGATCGCTGTGCCCATCGCTGGG
GGGCTCATTCTAGTGCTTCTCATCATGCTCGCCTTGCGGATGCTTCGAAGTGAAAACAAA
AGACTGCAGGACCAGAGGCAGCAGATGCTGTCCCGCTTGCACTACAACTTTCATGGA---
CACCACACGAAGAAGGGCCAGGTAGCCAAACTGGATTTGGAATGCATGGTTCCCGTAACC
GGACACGAGAACTGCTGTATGACTTGCGACAAACTGCGACAGTCTGAACTCCACAAT---
---------------GATAAATTGCTGTCTTTAGTTCACTGGGGAATTTACAGCGGTCAC
GGGAAATTGGAATttgta
ATGGATCGC---------CTGGTTTCTCTGTGGTTTCAGCTGGAACTTTGTGCGATGGCT
GTTCTTCTCACGAAAGGAGAGATCAGGTGCTACTGTGACGCACCGCACTGCGTTGCCACC
GGATACATGTGTAAATCAGAGCTCAACGCTTGCTTTACTAAGGTCCTGGACCCTCTTAAC
ACAAACTCACCTTTAACACACGGCTGCGTGGATTCGCTTTTAAACTCTGCAGACGTGTGC
TCTAGTAAAAATGTGGACATTTCAAGTGGAAGCTCCTCTCCTGTGGAGTGCTGCCATGAT
GATATGTGTAACTACAGGGGTTTGCATGAC---CTCACACACCCCAGAGGGGACTCAACA
GAC---------CGATACCACAGC---TCCAATCAGAACCTGATCACAAGGGTGCAAGAG
TTAGCGTCTGCTAAAGAGGTGTGGTTCCGGGCGGCGGTGATAGCGGTTCCCATCGCGGGT
GGGCTTATCCTGGTTCTGCTGATTATGCTGGCGTTGCGAATGCTCCGTAGCGAAAACAAG
CGTCTCCAGGCACAGCGCCAGCAGATGCTTTCTCGCCTGCATTACAGCTTTCACGGACAC
CACCATGCCAAGAAAGGCCACGTGGCTAAGTTGGACTTGGAGTGTATGGTGCCGGTAACG
GGACATGAGAACTGTTGTCTGGGCTGCGATAAGCTGCGGCAGACGGATTTGTGCACTGGA
GGAGGAAGCGGGGGTGAGCGTCTCCTATCTCTGGTACACTGGGGGATGTACACGGGGCAC
GGAAAGCTGGAGTTCGTA

...

To batch calculate 4DTV, simply merge many axt files into one file (AXT file) using a shell script. Note: A sequence does not have a line break in the merged AXT file.

> Merged.AXT
for file in `ls  *.axt`;do
  Ln=$((`sed '/^$/d' axt/$file | wc -l`/2+1))
  if [ $Ln -ne 0 ];then
     sed "$Ln a \%" axt/$file | sed '1 a \%' | tr -d "\n" | tr "%" "\n" >> sample/${n}/$2.AXT &
  fi
done

Merged.AXT

seq1-seq2
ATGTCTCATATGTCTTCTGTGAACGCGAAAAATCTTCAAAAACTAGCAGATTCAATTGTCAAACATGTAAAGCACTTTAACAATAATGAAGTTTTGTGTCTGATCAAACTCTTCAATGTG CTGATGGGAGAGCAGAGCGAGCACAGGGTTGGAAATGGACTGGATCGTGGTAAATTCAGGAGCATCCTCCACAACACATTTGGAATGACAGATGACATGATTATGGACAGAGTCTTCCGT GCATTTGACAAGGACAATGATAGCAACGTCAGTGTAAAAGAATGGATAGAAGGACTTTCAGTGTTTCTGCGAGGGACCTTGgatgaaaaaattaaatATTGTTTTGAGGTTTATGACTTA AATGGGGATGGATATATTTCACGAGAAGAGATGTTCCACATGCTGAAAAACAGTCTCATAAAACAACCAACAGAAGAAGATCCAGATGAAGGGATTAAGGACTTGGTAGAGATTACTCTT AAAAAGATGgaCCACGATCACGACAGCAGACTTTCATACGCTGATTTTGAGAAAGCAGTAAAAGAAGAAAATCTCTTGCTTGAGGCTTTTGGAGCTTGTCTTCCTGATGCAAAGagtaTTCTTGCTTTTGAAAGACAGGCCTTCCAG------GATACCACAGAAAAT
atgctgaaaatgTCGGCGATGAACAGAAAATTAATTCAAAACCTCGCCGAGACTTTATGCAGACAAGTCAAACATTTTAATAAAACAGAGACGGAGTGTCTGATAAGGCTGTTCAACAGT CTGCTGGGAGAGCAGGCAGAGAGAAAGACGACTATTGGAGTGGACCGGGCCAAATTCAGAAATATACTGCACCACACTTTCGGGATGACCGACGACATGATGACGGACAGAGTTTGTCGT GTCATTGACAAGGACAACGATGGCTACTTAAGCGTTAAAGAGTGGGTTGAggctctgtctgtctttctaagAGGCACACTGGATGAAAAAATGAAATaCTGTTTTGAGGTGTATGACCTG AACGGGGATGGATACATCTCACGTGAGGAGATGTTTCAGATGCTGAAAGACAGCCTCATCAGGCAGCCCACCGAAGAGGATCCTGATGAGGGGATTAAGGATATTGTGGAGATTGCCTTG AAAAAAATGGATTATGACCATGATGGAAGAGTTTCTTATGCTGATTTTGAGAAGACGGTCATGGATGAAAACCTTTTACTAGAAGCTTTTGGAAACTGCCTTCCTGATGCAAAGAGTGTACTAGCATTTGAGCAACAGGCATTCCAGAAACACGAACACTGCAAAGAA
seq3-seq4
ATGGATCGCCATTCCAAtttaatttccatttggctgcaACTGGAACTGTGTGCCATGGCAGTACTTCTGGCAAAAGGGGAGATAAGATGCTACTGTGATGCAGCGCATTGTGTGGCAACA GGTTACATGTGTAAATCCGAGTTAAATGCCTGCTTCACCAGGCTTCTGGACCCACAGAACACAAACTCCCCTCTCACGCATGGCTGCTTGGACCCGACTGCAAACACAGCAGATGTTTGC CATGCTGGAAGGACAGAGAGCCGCGCTGGGGCCTCGGAGAAGCTTGAGTGCTGTCACGACGATATGTGCAATTACAGAGGACTCCATGATGTTGTTTCATATCCCAGGGGGGACAGCTCA GATCATGGAACAAGATATCAGCCAGACAGTAGCAGGAATCTTCTGACCAGGGTTCAGGATTTAACATCCTCTAAAGAGCTGTGGTTCAGAGCAGCCGTGATCGCTGTGCCCATCGCTGGG GGGCTCATTCTAGTGCTTCTCATCATGCTCGCCTTGCGGATGCTTCGAAGTGAAAACAAAAGACTGCAGGACCAGAGGCAGCAGATGCTGTCCCGCTTGCACTACAACTTTCATGGA--CACCACACGAAGAAGGGCCAGGTAGCCAAACTGGATTTGGAATGCATGGTTCCCGTAACCGGACACGAGAACTGCTGTATGACTTGCGACAAACTGCGACAGTCTGAACTCCACAAT-----------------GATAAATTGCTGTCTTTAGTTCACTGGGGAATTTACAGCGGTCAC GGGAAATTGGAATttgta
ATGGATCGC---------CTGGTTTCTCTGTGGTTTCAGCTGGAACTTTGTGCGATGGCTGTTCTTCTCACGAAAGGAGAGATCAGGTGCTACTGTGACGCACCGCACTGCGTTGCCACC GGATACATGTGTAAATCAGAGCTCAACGCTTGCTTTACTAAGGTCCTGGACCCTCTTAACACAAACTCACCTTTAACACACGGCTGCGTGGATTCGCTTTTAAACTCTGCAGACGTGTGC TCTAGTAAAAATGTGGACATTTCAAGTGGAAGCTCCTCTCCTGTGGAGTGCTGCCATGATGATATGTGTAACTACAGGGGTTTGCATGAC---CTCACACACCCCAGAGGGGACTCAACAGAC---------CGATACCACAGC---TCCAATCAGAACCTGATCACAAGGGTGCAAGAGTTAGCGTCTGCTAAAGAGGTGTGGTTCCGGGCGGCGGTGATAGCGGTTCCCATCGCGGGTGGGCTTATCCTGGTTCTGCTGATTATGCTGGCGTTGCGAATGCTCCGTAGCGAAAACAAG CGTCTCCAGGCACAGCGCCAGCAGATGCTTTCTCGCCTGCATTACAGCTTTCACGGACACCACCATGCCAAGAAAGGCCACGTGGCTAAGTTGGACTTGGAGTGTATGGTGCCGGTAACG GGACATGAGAACTGTTGTCTGGGCTGCGATAAGCTGCGGCAGACGGATTTGTGCACTGGAGGAGGAAGCGGGGGTGAGCGTCTCCTATCTCTGGTACACTGGGGGATGTACACGGGGCACGGAAAGCTGGAGTTCGTA

Batch 4DTV calculation

batch_4DTV_calculation.pl Merged.AXT > Merged.4DTV

The 4DTV results are in Merged.4DTV