Calculate 4DTV (transversion rate on 4-fold degenerated sites) faster with batch axt files
Cite this script (batch_4DTV_calculation.pl):
Jianguo Lu, Peilin Huang, Jialiang Sun, Jian Liu, DupScan: predicting and visualizing vertebrate genome duplication database, Nucleic Acids Research, 2022;, gkac718, https://doi.org/10.1093/nar/gkac718
This script was based on the "calculate_4DTV_correction.pl"
The old script can only calculate 4DTV for a pair of sequences at a time, which are contained in an axt file. In general, there is nothing wrong with this approach. However, in the case of heavy computation, this not only slows down the progress, but also tends to cause the process to crash.
If you have many axt files:
example1.axt
seq1-seq2
ATGTCTCATATGTCTTCTGTGAACGCGAAAAATCTTCAAAAACTAGCAGATTCAATTGTC
AAACATGTAAAGCACTTTAACAATAATGAAGTTTTGTGTCTGATCAAACTCTTCAATGTG
CTGATGGGAGAGCAGAGCGAGCACAGGGTTGGAAATGGACTGGATCGTGGTAAATTCAGG
AGCATCCTCCACAACACATTTGGAATGACAGATGACATGATTATGGACAGAGTCTTCCGT
GCATTTGACAAGGACAATGATAGCAACGTCAGTGTAAAAGAATGGATAGAAGGACTTTCA
GTGTTTCTGCGAGGGACCTTGgatgaaaaaattaaatATTGTTTTGAGGTTTATGACTTA
AATGGGGATGGATATATTTCACGAGAAGAGATGTTCCACATGCTGAAAAACAGTCTCATA
AAACAACCAACAGAAGAAGATCCAGATGAAGGGATTAAGGACTTGGTAGAGATTACTCTT
AAAAAGATGgaCCACGATCACGACAGCAGACTTTCATACGCTGATTTTGAGAAAGCAGTA
AAAGAAGAAAATCTCTTGCTTGAGGCTTTTGGAGCTTGTCTTCCTGATGCAAAGagtaTT
CTTGCTTTTGAAAGACAGGCCTTCCAG------GATACCACAGAAAAT
atgctgaaaatgTCGGCGATGAACAGAAAATTAATTCAAAACCTCGCCGAGACTTTATGC
AGACAAGTCAAACATTTTAATAAAACAGAGACGGAGTGTCTGATAAGGCTGTTCAACAGT
CTGCTGGGAGAGCAGGCAGAGAGAAAGACGACTATTGGAGTGGACCGGGCCAAATTCAGA
AATATACTGCACCACACTTTCGGGATGACCGACGACATGATGACGGACAGAGTTTGTCGT
GTCATTGACAAGGACAACGATGGCTACTTAAGCGTTAAAGAGTGGGTTGAggctctgtct
gtctttctaagAGGCACACTGGATGAAAAAATGAAATaCTGTTTTGAGGTGTATGACCTG
AACGGGGATGGATACATCTCACGTGAGGAGATGTTTCAGATGCTGAAAGACAGCCTCATC
AGGCAGCCCACCGAAGAGGATCCTGATGAGGGGATTAAGGATATTGTGGAGATTGCCTTG
AAAAAAATGGATTATGACCATGATGGAAGAGTTTCTTATGCTGATTTTGAGAAGACGGTC
ATGGATGAAAACCTTTTACTAGAAGCTTTTGGAAACTGCCTTCCTGATGCAAAGAGTGTA
CTAGCATTTGAGCAACAGGCATTCCAGAAACACGAACACTGCAAAGAA
example2.axt
seq3-seq4
ATGGATCGCCATTCCAAtttaatttccatttggctgcaACTGGAACTGTGTGCCATGGCA
GTACTTCTGGCAAAAGGGGAGATAAGATGCTACTGTGATGCAGCGCATTGTGTGGCAACA
GGTTACATGTGTAAATCCGAGTTAAATGCCTGCTTCACCAGGCTTCTGGACCCACAGAAC
ACAAACTCCCCTCTCACGCATGGCTGCTTGGACCCGACTGCAAACACAGCAGATGTTTGC
CATGCTGGAAGGACAGAGAGCCGCGCTGGGGCCTCGGAGAAGCTTGAGTGCTGTCACGAC
GATATGTGCAATTACAGAGGACTCCATGATGTTGTTTCATATCCCAGGGGGGACAGCTCA
GATCATGGAACAAGATATCAGCCAGACAGTAGCAGGAATCTTCTGACCAGGGTTCAGGAT
TTAACATCCTCTAAAGAGCTGTGGTTCAGAGCAGCCGTGATCGCTGTGCCCATCGCTGGG
GGGCTCATTCTAGTGCTTCTCATCATGCTCGCCTTGCGGATGCTTCGAAGTGAAAACAAA
AGACTGCAGGACCAGAGGCAGCAGATGCTGTCCCGCTTGCACTACAACTTTCATGGA---
CACCACACGAAGAAGGGCCAGGTAGCCAAACTGGATTTGGAATGCATGGTTCCCGTAACC
GGACACGAGAACTGCTGTATGACTTGCGACAAACTGCGACAGTCTGAACTCCACAAT---
---------------GATAAATTGCTGTCTTTAGTTCACTGGGGAATTTACAGCGGTCAC
GGGAAATTGGAATttgta
ATGGATCGC---------CTGGTTTCTCTGTGGTTTCAGCTGGAACTTTGTGCGATGGCT
GTTCTTCTCACGAAAGGAGAGATCAGGTGCTACTGTGACGCACCGCACTGCGTTGCCACC
GGATACATGTGTAAATCAGAGCTCAACGCTTGCTTTACTAAGGTCCTGGACCCTCTTAAC
ACAAACTCACCTTTAACACACGGCTGCGTGGATTCGCTTTTAAACTCTGCAGACGTGTGC
TCTAGTAAAAATGTGGACATTTCAAGTGGAAGCTCCTCTCCTGTGGAGTGCTGCCATGAT
GATATGTGTAACTACAGGGGTTTGCATGAC---CTCACACACCCCAGAGGGGACTCAACA
GAC---------CGATACCACAGC---TCCAATCAGAACCTGATCACAAGGGTGCAAGAG
TTAGCGTCTGCTAAAGAGGTGTGGTTCCGGGCGGCGGTGATAGCGGTTCCCATCGCGGGT
GGGCTTATCCTGGTTCTGCTGATTATGCTGGCGTTGCGAATGCTCCGTAGCGAAAACAAG
CGTCTCCAGGCACAGCGCCAGCAGATGCTTTCTCGCCTGCATTACAGCTTTCACGGACAC
CACCATGCCAAGAAAGGCCACGTGGCTAAGTTGGACTTGGAGTGTATGGTGCCGGTAACG
GGACATGAGAACTGTTGTCTGGGCTGCGATAAGCTGCGGCAGACGGATTTGTGCACTGGA
GGAGGAAGCGGGGGTGAGCGTCTCCTATCTCTGGTACACTGGGGGATGTACACGGGGCAC
GGAAAGCTGGAGTTCGTA
...
To batch calculate 4DTV, simply merge many axt files into one file (AXT file) using a shell script. Note: A sequence does not have a line break in the merged AXT file.
> Merged.AXT
for file in `ls *.axt`;do
Ln=$((`sed '/^$/d' axt/$file | wc -l`/2+1))
if [ $Ln -ne 0 ];then
sed "$Ln a \%" axt/$file | sed '1 a \%' | tr -d "\n" | tr "%" "\n" >> sample/${n}/$2.AXT &
fi
done
Merged.AXT
seq1-seq2
ATGTCTCATATGTCTTCTGTGAACGCGAAAAATCTTCAAAAACTAGCAGATTCAATTGTCAAACATGTAAAGCACTTTAACAATAATGAAGTTTTGTGTCTGATCAAACTCTTCAATGTG CTGATGGGAGAGCAGAGCGAGCACAGGGTTGGAAATGGACTGGATCGTGGTAAATTCAGGAGCATCCTCCACAACACATTTGGAATGACAGATGACATGATTATGGACAGAGTCTTCCGT GCATTTGACAAGGACAATGATAGCAACGTCAGTGTAAAAGAATGGATAGAAGGACTTTCAGTGTTTCTGCGAGGGACCTTGgatgaaaaaattaaatATTGTTTTGAGGTTTATGACTTA AATGGGGATGGATATATTTCACGAGAAGAGATGTTCCACATGCTGAAAAACAGTCTCATAAAACAACCAACAGAAGAAGATCCAGATGAAGGGATTAAGGACTTGGTAGAGATTACTCTT AAAAAGATGgaCCACGATCACGACAGCAGACTTTCATACGCTGATTTTGAGAAAGCAGTAAAAGAAGAAAATCTCTTGCTTGAGGCTTTTGGAGCTTGTCTTCCTGATGCAAAGagtaTTCTTGCTTTTGAAAGACAGGCCTTCCAG------GATACCACAGAAAAT
atgctgaaaatgTCGGCGATGAACAGAAAATTAATTCAAAACCTCGCCGAGACTTTATGCAGACAAGTCAAACATTTTAATAAAACAGAGACGGAGTGTCTGATAAGGCTGTTCAACAGT CTGCTGGGAGAGCAGGCAGAGAGAAAGACGACTATTGGAGTGGACCGGGCCAAATTCAGAAATATACTGCACCACACTTTCGGGATGACCGACGACATGATGACGGACAGAGTTTGTCGT GTCATTGACAAGGACAACGATGGCTACTTAAGCGTTAAAGAGTGGGTTGAggctctgtctgtctttctaagAGGCACACTGGATGAAAAAATGAAATaCTGTTTTGAGGTGTATGACCTG AACGGGGATGGATACATCTCACGTGAGGAGATGTTTCAGATGCTGAAAGACAGCCTCATCAGGCAGCCCACCGAAGAGGATCCTGATGAGGGGATTAAGGATATTGTGGAGATTGCCTTG AAAAAAATGGATTATGACCATGATGGAAGAGTTTCTTATGCTGATTTTGAGAAGACGGTCATGGATGAAAACCTTTTACTAGAAGCTTTTGGAAACTGCCTTCCTGATGCAAAGAGTGTACTAGCATTTGAGCAACAGGCATTCCAGAAACACGAACACTGCAAAGAA
seq3-seq4
ATGGATCGCCATTCCAAtttaatttccatttggctgcaACTGGAACTGTGTGCCATGGCAGTACTTCTGGCAAAAGGGGAGATAAGATGCTACTGTGATGCAGCGCATTGTGTGGCAACA GGTTACATGTGTAAATCCGAGTTAAATGCCTGCTTCACCAGGCTTCTGGACCCACAGAACACAAACTCCCCTCTCACGCATGGCTGCTTGGACCCGACTGCAAACACAGCAGATGTTTGC CATGCTGGAAGGACAGAGAGCCGCGCTGGGGCCTCGGAGAAGCTTGAGTGCTGTCACGACGATATGTGCAATTACAGAGGACTCCATGATGTTGTTTCATATCCCAGGGGGGACAGCTCA GATCATGGAACAAGATATCAGCCAGACAGTAGCAGGAATCTTCTGACCAGGGTTCAGGATTTAACATCCTCTAAAGAGCTGTGGTTCAGAGCAGCCGTGATCGCTGTGCCCATCGCTGGG GGGCTCATTCTAGTGCTTCTCATCATGCTCGCCTTGCGGATGCTTCGAAGTGAAAACAAAAGACTGCAGGACCAGAGGCAGCAGATGCTGTCCCGCTTGCACTACAACTTTCATGGA--CACCACACGAAGAAGGGCCAGGTAGCCAAACTGGATTTGGAATGCATGGTTCCCGTAACCGGACACGAGAACTGCTGTATGACTTGCGACAAACTGCGACAGTCTGAACTCCACAAT-----------------GATAAATTGCTGTCTTTAGTTCACTGGGGAATTTACAGCGGTCAC GGGAAATTGGAATttgta
ATGGATCGC---------CTGGTTTCTCTGTGGTTTCAGCTGGAACTTTGTGCGATGGCTGTTCTTCTCACGAAAGGAGAGATCAGGTGCTACTGTGACGCACCGCACTGCGTTGCCACC GGATACATGTGTAAATCAGAGCTCAACGCTTGCTTTACTAAGGTCCTGGACCCTCTTAACACAAACTCACCTTTAACACACGGCTGCGTGGATTCGCTTTTAAACTCTGCAGACGTGTGC TCTAGTAAAAATGTGGACATTTCAAGTGGAAGCTCCTCTCCTGTGGAGTGCTGCCATGATGATATGTGTAACTACAGGGGTTTGCATGAC---CTCACACACCCCAGAGGGGACTCAACAGAC---------CGATACCACAGC---TCCAATCAGAACCTGATCACAAGGGTGCAAGAGTTAGCGTCTGCTAAAGAGGTGTGGTTCCGGGCGGCGGTGATAGCGGTTCCCATCGCGGGTGGGCTTATCCTGGTTCTGCTGATTATGCTGGCGTTGCGAATGCTCCGTAGCGAAAACAAG CGTCTCCAGGCACAGCGCCAGCAGATGCTTTCTCGCCTGCATTACAGCTTTCACGGACACCACCATGCCAAGAAAGGCCACGTGGCTAAGTTGGACTTGGAGTGTATGGTGCCGGTAACG GGACATGAGAACTGTTGTCTGGGCTGCGATAAGCTGCGGCAGACGGATTTGTGCACTGGAGGAGGAAGCGGGGGTGAGCGTCTCCTATCTCTGGTACACTGGGGGATGTACACGGGGCACGGAAAGCTGGAGTTCGTA
batch_4DTV_calculation.pl Merged.AXT > Merged.4DTV
The 4DTV results are in Merged.4DTV