Question: Concatenated sequences as input?

nbawe commented 3 years ago

@xavierdidelot Is it important to use concatenated sequences of core genome as mentioned in issue #39 or can the concatenated sequences also be used?

Thank you!

xavierdidelot commented 3 years ago

The best input is a fasta file containing your genome alignment (aligned for example against a reference genome). Alternatively, if you have multiple alignments of core genes or core genome regions, you can combine them into a xmfa file.

nbawe commented 3 years ago

@xavierdidelot thank you for prompt response I have ran CFML using concatenated core genome fasta but now reading #39 I am afraid that is not the best way to use CFML.

So I have few more specific questions:

I have used fast-GeP published in Bioinformatics. As I understad it produces concatenated sequece alignment of core genes example output available here.

1) Am I correct that the file produced core_genomes.fas contains concatenated fasta sequence of all the core genes and it is not recommended to use with CFML? 2) fast-GeP also produces file named clonalframe.dat which is essentially xmfa that is more suitable to use with CFML? Example of clonalframe.dat:

#gene1
>genome1.fas
atgccaatttttat...aaaaataaatacttga
>genome2.fas
atgccaatttttat...aaaaataaatacttga
>genome3.fas
atgccaatttttat...aaaaataaatacttga
=
#gene2
>genome1.fas
ttgattttaagtgg...gcttcaccaatagcaggcggaatatag
>genome2.fas
ttgattttaagtgg...gcttcaccaatagcaggcggaatatag
>genome3.fas
ttgattttaagtgg...gcttcaccaatagcaggcggaatatag
...
=
#geneX
>genome1
ttga...atatag
>genome2
ttga...atatag
>genome2
ttga...atatag

Thank you in advance!

xavierdidelot commented 3 years ago

Yes the file clonalframe.dat looks exactly like the xmfa file you need to run ClonalFrameML on the output of fast-GeP. Presumably this is the reason why this file is named like this!

nbawe commented 3 years ago

Thank you!

nbawe commented 3 years ago

@xavierdidelot One more question do you have any idea why i have different number of sites and should the xmfa file end with equal sign (=): 1) when using concatenated fasta e.g. core_genomes.fas I get the following: Read 7 sequences of length 1487811 sites from core_genomes.fas 2) when using xmfa e.g. clonalframe.dat. I get the following: Read 7 sequences of length 3180811 sites from clonalframe.dat

One more interesting find in output is that except the sites number on line Read 7 sequences of length ... sites from ... the other part of log is completely the same and reads:

core_genome.fas:

ClonalFrameML v1.12
em = true
emsim = 100
output_filtered = true
show_progress = true
Finished reading in control file.

Read 7 sequences of length 1487811 sites from core_genomes.fas
IMPUTATION AND RECONSTRUCTION OF ANCESTRAL STATES:
Analysing 182 sites
Empirical nucleotide frequencies:   A 36.3%   C 12.7%   G 18.1%   T 32.9%
Maximum log-likelihood for imputation and ancestral state reconstruction = -2423.56
Wrote imputed and reconstructed ancestral states to cfml_emsim.ML_sequence.fasta
Wrote position cross-reference file to cfml_emsim.position_cross_reference.txt
BRANCH LENGTH CORRECTION/RECOMBINATION ANALYSIS:
Analysing 1487811 sites
Empirical nucleotide frequencies:   A 36.3%   C 12.7%   G 18.1%   T 32.9%
Beginning branch optimization. Key to parameters (and constraints):
B   uncorrected branch length
L   maximum unnormalized log-posterior per branch
P   contribution of the prior to L
R   R/theta per branch                                       (> 0)
I   mean DNA import length per branch                        (> 0)
D   divergence of DNA imported by recombination              (> 0)
M   expected number of mutations per branch                  (> 0)
nmut = 21.9572048 nU = 1487808.4 nsub = 0.0427951901 nI = 2.60098272
nU>I = 0.087307644 dU = 1487807.4 nI>U = 0.0873074643 dI = 2.60095224
numTrans = 1487807.31 0.087307644 0.0873074643 1487807.31
...

clonalframe.dat:

ClonalFrameML v1.12
xmfa_file = true
em = true
emsim = 100
output_filtered = true
show_progress = true
Finished reading in control file.

Read 7 sequences of length 3180811 sites from clonalframe.dat
IMPUTATION AND RECONSTRUCTION OF ANCESTRAL STATES:
Analysing 182 sites
Empirical nucleotide frequencies:   A 36.3%   C 12.7%   G 18.1%   T 32.9%
Maximum log-likelihood for imputation and ancestral state reconstruction = -2423.56
Wrote imputed and reconstructed ancestral states to cfml_emsim.ML_sequence.fasta
Wrote position cross-reference file to cfml_emsim.position_cross_reference.txt
BRANCH LENGTH CORRECTION/RECOMBINATION ANALYSIS:
Analysing 1487811 sites
Empirical nucleotide frequencies:   A 36.3%   C 12.7%   G 18.1%   T 32.9%
Beginning branch optimization. Key to parameters (and constraints):
B   uncorrected branch length
L   maximum unnormalized log-posterior per branch
P   contribution of the prior to L
R   R/theta per branch                                       (> 0)
I   mean DNA import length per branch                        (> 0)
D   divergence of DNA imported by recombination              (> 0)
M   expected number of mutations per branch                  (> 0)
nmut = 21.8767437 nU = 1487803.86 nsub = 0.123256274 nI = 7.13648733
nU>I = 0.179446013 dU = 1486109.96 nI>U = 0.137480711 dI = 7.04027124
numTrans = 1486109.78 0.179446013 0.137480711 1486109.78
...

xavierdidelot commented 3 years ago

That's because when working with a xmfa file, ClonalFrameML inserts 1000bp blank regions between each consecutive block.

nbawe commented 3 years ago

@xavierdidelot Thank you a lot

xavierdidelot / ClonalFrameML

Question: Concatenated sequences as input? #118