xavierdidelot / ClonalFrameML

ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes
GNU General Public License v3.0
108 stars 27 forks source link

-ignore_incomplete_sites flag not 'ignoring' ambiguous sites #73

Closed mmelendrez closed 6 years ago

mmelendrez commented 6 years ago

Hi- I'am running clonalframeML on a Linux mint distribution 18.2 Sonya

It download and compiled just fine - my alignment has many ambiguous sites, as I was unsure if the algorithm could handle ambiguous sites I ran it without the -ignore_incomplete_sites flag:

mel@lucita ~/Analysis/ClonalFrameML $ ~/Software/ClonalFrameML/src/ClonalFrameML SynA_out_trim_1334_names_MLrooted.nwk SynA_out_trim_1334_names.fasta SynA_out_trim_1334_names_CFML
ClonalFrameML v1.11-3-g4f13f23
Finished reading in control file.

Read 15 sequences of length 1256468 sites from SynA_out_trim_1334_names.fasta
ERROR: FASTA_to_nucleotide(): unsupported base K in sequence 7 (60AY4M2_PEA14) position 397

So that suggested the algorithm doesn't work with alignments that contain ambiguous sites, so I then specified the -ignore_incomplete_sites

mel@lucita ~/Analysis/ClonalFrameML $ ~/Software/ClonalFrameML/src/ClonalFrameML SynA_out_trim_1334_names_MLrooted.nwk SynA_out_trim_1334_names.fasta SynA_out_trim_1334_names_CFML -ignore_incomplete_sites true > test.log.txt
mel@lucita ~/Analysis/ClonalFrameML $ cat test.log.txt 
ClonalFrameML v1.11-3-g4f13f23
ignore_incomplete_sites = true
Finished reading in control file.

Read 15 sequences of length 1256468 sites from SynA_out_trim_1334_names.fasta
ERROR: FASTA_to_nucleotide(): unsupported base K in sequence 7 (60AY4M2_PEA14) position 397

Same error. Suggestions? Does the algorithm only consider 'N' an ambiguous base or does it support the IUPAC ambiguity codes?

Thanks!

tseemann commented 6 years ago

Given you got the error ERROR: FASTA_to_nucleotide(): unsupported base K in sequence 7 it seems very likely that CFML does not support those IUPAC codes.

The manual https://github.com/xavierdidelot/clonalframeml/wiki also says _"The positions listed in the file specified by the -ignore_usersites option lists any site that was not callable or present in all the genomes." so it seems for any NON-CORE sites you need to provide a file of all those coordinates?

xavierdidelot commented 6 years ago

Hi Mel,

Yes indeed ClonalFrameML does not recognise the IUPAC codes, you will need to replace them with gaps. This needs to be done even if the site is excluded using -ignore_user_sites option or -ignore_incomplete_sites because the file is read before these tags are applied.

Best wishes, Xavier

mmelendrez commented 6 years ago

Ok fair enough - perhaps a bit of clarification in the manual on this:

-ignore_incomplete_sites       true or false (default)   Ignore sites with any ambiguous bases.

would be great then. I erroneously assumed this meant if I had ambiguous bases as defined by IUPAC that those sites would get automatically ignored. I chose this over

-ignore_use_sites

option because I didn't have a list of sites to ignore (ie. sites with ambiguities) - as I thought the previous flag would 'auto-ignore' those sites.

I will go back an omit those sites from my alignment and retry. I hesitate to turn them all into 'gaps' as wouldn't that impact genomic reconstruction? Unless gap sites are ignore in the algorithm.

Thanks so much for both speedy replies.

V/R - Mel