Open mitchawitt1 opened 2 weeks ago
Hi @mitchawitt1, thanks for using Maast. I think maast db
could be useful in your case. Except for VCF, MSA Fasta, and reference.fna, the module also requires --tag-fna-list
which points to a file containing paths to the genomes included in the VCF and MSA file, and --fna-dir
which points to a dir containing all genomes of your dataset, a.k.a, tag genomes + non-tag genomes. The reason is that Maast not only extracts SNP-covering kmers based on VCF and MSA files, it also perform a filtering step to select the ones frequently present in the population as well as remove those are ambigous. Another step I would done is to make sure the VCF and MSA file have the same format as those you may found from an example run. The genomes in the VCF and MSA fasta should also perfectly match. I hope this reply helps. Please feel free to follow up.
ahh sorry @zjshi I didn't specify that as well, but thank you for your response. I actually have included a --tag-fna-list
and an --fna-dir
in my maast db
call as well. My tag fna list just includes all of the assemblies in my --fna-dir
, as my alignment includes all of the genomes I want to examine. In this case, I guess you can think of it as having all tag genomes, I don't know if this would cause a flaw in the maast
logic.
I also noticed my multiple sequence alignment file has multiple newlines in the sequences, whereas the maast
generated MSAs sequences are one line each, so I have adjusted that and started a run which will take an hour or two.
Here is my localized maast call for reference. Here you can see I haven't specified --snp-cover
or --kmer-type
as I thought the default for these two arguments are appropriate.
REF="KPNIH1.fasta"
THREADS=1
VCF="ldvs_maast.vcf"
MSA="gubbins_masked_msa.fa"
TAG_LIST="tag_fna_list"
FNA_DIR="../test_cases/gubbins/assemblies/"
OUT_DIR="LDV_DB/"
mkdir -p $OUT_DIR
maast db --ref-genome $REF --threads $THREADS --vcf $VCF --msa $MSA --out-dir $OUT_DIR --tag-fna-list $TAG_LIST --fna-dir $FNA_DIR --out-dir $OUT_DIR
Are the only filtering criterion specified by --snp-cover
and --kmer-type
?
I wanted to use
Maast
's sck-mer database construction commanddb
on a set of VCF, MSA Fasta, and reference.fna (Klebsiella pneumoniae subsp. pneumoniae reference genome KPNIH1) files which are not generated byMaast
but by another pipeline. The purpose of using this database is similar toMaast
in genotyping reads of population sequences, so using thegt
command would also be an ideal outcome of makingMaast
generated sck-mers. I have made scripts that try to match my files to the specific file formats thatMaast
generates in the core snp calling step, but haven't modified the source code at all.However, I only get sck-mers called for 5 out of 3,775 SNPs in my .vcf, resulting in only 155 kmers in my database. An
end_to_end
run on the same dataset (without redundancy reduction and with reference specified) gives me about 55k core SNPs found, resulting in over a million sck-mers, which seems appropriate according to my previousend_to_end
runs, so I know it wouldn't be the assemblies that I'm working with.I was wondering if there any other format-specific details I should know about, or whether
maast db
would be the appropriate approach for this. Is my use of reference .fna appropriate in this case?