Using Maast to construct a custom sck-mer database

mitchawitt1 commented 2 weeks ago

I wanted to use Maast's sck-mer database construction command db on a set of VCF, MSA Fasta, and reference.fna (Klebsiella pneumoniae subsp. pneumoniae reference genome KPNIH1) files which are not generated by Maast but by another pipeline. The purpose of using this database is similar to Maast in genotyping reads of population sequences, so using the gt command would also be an ideal outcome of making Maast generated sck-mers. I have made scripts that try to match my files to the specific file formats that Maast generates in the core snp calling step, but haven't modified the source code at all.

However, I only get sck-mers called for 5 out of 3,775 SNPs in my .vcf, resulting in only 155 kmers in my database. An end_to_end run on the same dataset (without redundancy reduction and with reference specified) gives me about 55k core SNPs found, resulting in over a million sck-mers, which seems appropriate according to my previous end_to_end runs, so I know it wouldn't be the assemblies that I'm working with.

I was wondering if there any other format-specific details I should know about, or whether maast db would be the appropriate approach for this. Is my use of reference .fna appropriate in this case?

zjshi commented 1 week ago

Hi @mitchawitt1, thanks for using Maast. I think maast db could be useful in your case. Except for VCF, MSA Fasta, and reference.fna, the module also requires --tag-fna-list which points to a file containing paths to the genomes included in the VCF and MSA file, and --fna-dir which points to a dir containing all genomes of your dataset, a.k.a, tag genomes + non-tag genomes. The reason is that Maast not only extracts SNP-covering kmers based on VCF and MSA files, it also perform a filtering step to select the ones frequently present in the population as well as remove those are ambigous. Another step I would done is to make sure the VCF and MSA file have the same format as those you may found from an example run. The genomes in the VCF and MSA fasta should also perfectly match. I hope this reply helps. Please feel free to follow up.

mitchawitt1 commented 1 week ago

ahh sorry @zjshi I didn't specify that as well, but thank you for your response. I actually have included a --tag-fna-list and an --fna-dir in my maast db call as well. My tag fna list just includes all of the assemblies in my --fna-dir, as my alignment includes all of the genomes I want to examine. In this case, I guess you can think of it as having all tag genomes, I don't know if this would cause a flaw in the maast logic.

I also noticed my multiple sequence alignment file has multiple newlines in the sequences, whereas the maast generated MSAs sequences are one line each, so I have adjusted that and started a run which will take an hour or two.

Here is my localized maast call for reference. Here you can see I haven't specified --snp-cover or --kmer-type as I thought the default for these two arguments are appropriate.

REF="KPNIH1.fasta"
THREADS=1
VCF="ldvs_maast.vcf"
MSA="gubbins_masked_msa.fa"
TAG_LIST="tag_fna_list"
FNA_DIR="../test_cases/gubbins/assemblies/"
OUT_DIR="LDV_DB/"

mkdir -p $OUT_DIR

maast db --ref-genome $REF --threads $THREADS --vcf $VCF --msa $MSA --out-dir $OUT_DIR --tag-fna-list $TAG_LIST --fna-dir $FNA_DIR --out-dir $OUT_DIR

Are the only filtering criterion specified by --snp-cover and --kmer-type?

zjshi / Maast

Using Maast to construct a custom sck-mer database #30