zjshi / Maast

Microbial agile accurate SNP Typer
MIT License
24 stars 2 forks source link

SNP matrix for all genome #27

Open dugala239 opened 2 months ago

dugala239 commented 2 months ago

Hey,

We only get a vcf file for tag genomes and SNP files for each genome in gt_results. Do you have any suggestions for making a SNP matrix of all genome? Thanks!

doy-pin commented 1 week ago

Seconding this request. I plan to use it to run RAxML. I have attempted to use the resulting concat_allele.aln.fasta of maast tree command, but somehow RAxML only recognizes half of total SNPs listed in vcf file. Or maybe you have a way to tweak something when running RAxML using the concat_allele.aln.fasta?

Thank you very much!

zjshi commented 1 week ago

Hi thanks for using Maast! I think concat_allele.aln.fasta should be good for RAxML or iqtree with little or minor changes. Can you possibly share your file? I am happy to take a look to see whether there would be a simple workaround.

doy-pin commented 1 week ago

Hi!

the command I used for RAxML is: raxmlHPC-PTHREADS -s ${DIRR}concat_allele.aln.fasta -f a -m GTRGAMMA -p 12345 -x 12345 -N 1000 -n Xoo -T 50

I have attached the logfile (job is still running as of now) JobName.3693.txt attached the logfile (job is still running as of now)

doy-pin commented 1 week ago

I have about ~51,100 SNPs based on core_snps.vcf but RAxML only recognizes half of them

zjshi commented 1 week ago

Ok I see. Could you please verify the length of each concatenated allele sequences in concat_allele.aln.fasta? SNPs in core_snps.vcf are not necessarily all ended up in the MSA files due to several factors: bi-allelic nor not, covered by a good k-mer or not, prevalence of the site in the population, etc. I

doy-pin commented 1 week ago

based on seqkit stats, all entries in the concat_allele.aln.fasta have 48,920 bases.

zjshi commented 1 week ago

What about the invariant sites (i.e. sites has the same allele across all genomes) in the MSA? Would it be possibly due to automatic removal of these sites by RAxML?

doy-pin commented 1 week ago

I assume invariant site was already removed through maast tree command? I made all arguments in default. I will also try to run the aligned fasta in IQ-Tree, see if they would differ in terms of the number of distinct patterns identified.

zjshi commented 1 week ago

Yes you are right. Maast will remove sites below min MAF(Minor Allele Frequency) and min MAC(Minor Allele Count). Please let me know how it goes with IQ-tree. At the same time I will look into it on my end. Thanks.