Closed nbawe closed 3 years ago
The best input is a fasta file containing your genome alignment (aligned for example against a reference genome). Alternatively, if you have multiple alignments of core genes or core genome regions, you can combine them into a xmfa file.
@xavierdidelot thank you for prompt response I have ran CFML using concatenated core genome fasta but now reading #39 I am afraid that is not the best way to use CFML.
So I have few more specific questions:
I have used fast-GeP published in Bioinformatics. As I understad it produces concatenated sequece alignment of core genes example output available here.
1) Am I correct that the file produced core_genomes.fas
contains concatenated fasta sequence of all the core genes and it is not recommended to use with CFML?
2) fast-GeP also produces file named clonalframe.dat
which is essentially xmfa that is more suitable to use with CFML?
Example of clonalframe.dat:
#gene1
>genome1.fas
atgccaatttttat...aaaaataaatacttga
>genome2.fas
atgccaatttttat...aaaaataaatacttga
>genome3.fas
atgccaatttttat...aaaaataaatacttga
=
#gene2
>genome1.fas
ttgattttaagtgg...gcttcaccaatagcaggcggaatatag
>genome2.fas
ttgattttaagtgg...gcttcaccaatagcaggcggaatatag
>genome3.fas
ttgattttaagtgg...gcttcaccaatagcaggcggaatatag
...
=
#geneX
>genome1
ttga...atatag
>genome2
ttga...atatag
>genome2
ttga...atatag
Thank you in advance!
Yes the file clonalframe.dat
looks exactly like the xmfa file you need to run ClonalFrameML on the output of fast-GeP. Presumably this is the reason why this file is named like this!
Thank you!
@xavierdidelot One more question do you have any idea why i have different number of sites and should the xmfa file end with equal sign (=):
1) when using concatenated fasta e.g. core_genomes.fas I get the following:
Read 7 sequences of length 1487811 sites from core_genomes.fas
2) when using xmfa e.g. clonalframe.dat. I get the following:
Read 7 sequences of length 3180811 sites from clonalframe.dat
One more interesting find in output is that except the sites number on line Read 7 sequences of length ... sites from ...
the other part of log is completely the same and reads:
core_genome.fas:
ClonalFrameML v1.12
em = true
emsim = 100
output_filtered = true
show_progress = true
Finished reading in control file.
Read 7 sequences of length 1487811 sites from core_genomes.fas
IMPUTATION AND RECONSTRUCTION OF ANCESTRAL STATES:
Analysing 182 sites
Empirical nucleotide frequencies: A 36.3% C 12.7% G 18.1% T 32.9%
Maximum log-likelihood for imputation and ancestral state reconstruction = -2423.56
Wrote imputed and reconstructed ancestral states to cfml_emsim.ML_sequence.fasta
Wrote position cross-reference file to cfml_emsim.position_cross_reference.txt
BRANCH LENGTH CORRECTION/RECOMBINATION ANALYSIS:
Analysing 1487811 sites
Empirical nucleotide frequencies: A 36.3% C 12.7% G 18.1% T 32.9%
Beginning branch optimization. Key to parameters (and constraints):
B uncorrected branch length
L maximum unnormalized log-posterior per branch
P contribution of the prior to L
R R/theta per branch (> 0)
I mean DNA import length per branch (> 0)
D divergence of DNA imported by recombination (> 0)
M expected number of mutations per branch (> 0)
nmut = 21.9572048 nU = 1487808.4 nsub = 0.0427951901 nI = 2.60098272
nU>I = 0.087307644 dU = 1487807.4 nI>U = 0.0873074643 dI = 2.60095224
numTrans = 1487807.31 0.087307644 0.0873074643 1487807.31
...
clonalframe.dat:
ClonalFrameML v1.12
xmfa_file = true
em = true
emsim = 100
output_filtered = true
show_progress = true
Finished reading in control file.
Read 7 sequences of length 3180811 sites from clonalframe.dat
IMPUTATION AND RECONSTRUCTION OF ANCESTRAL STATES:
Analysing 182 sites
Empirical nucleotide frequencies: A 36.3% C 12.7% G 18.1% T 32.9%
Maximum log-likelihood for imputation and ancestral state reconstruction = -2423.56
Wrote imputed and reconstructed ancestral states to cfml_emsim.ML_sequence.fasta
Wrote position cross-reference file to cfml_emsim.position_cross_reference.txt
BRANCH LENGTH CORRECTION/RECOMBINATION ANALYSIS:
Analysing 1487811 sites
Empirical nucleotide frequencies: A 36.3% C 12.7% G 18.1% T 32.9%
Beginning branch optimization. Key to parameters (and constraints):
B uncorrected branch length
L maximum unnormalized log-posterior per branch
P contribution of the prior to L
R R/theta per branch (> 0)
I mean DNA import length per branch (> 0)
D divergence of DNA imported by recombination (> 0)
M expected number of mutations per branch (> 0)
nmut = 21.8767437 nU = 1487803.86 nsub = 0.123256274 nI = 7.13648733
nU>I = 0.179446013 dU = 1486109.96 nI>U = 0.137480711 dI = 7.04027124
numTrans = 1486109.78 0.179446013 0.137480711 1486109.78
...
That's because when working with a xmfa file, ClonalFrameML inserts 1000bp blank regions between each consecutive block.
@xavierdidelot Thank you a lot
@xavierdidelot Is it important to use concatenated sequences of core genome as mentioned in issue #39 or can the concatenated sequences also be used?
Thank you!