zjshi / Maast

Microbial agile accurate SNP Typer
MIT License
24 stars 2 forks source link

Will there be an option to truly specify the reference genome #22

Closed drhoads closed 6 months ago

drhoads commented 6 months ago

Finding the results quite different between ParSNP and Maast (as reported in your manuscript). Just curious about whether there will be a future option to specify the reference genome for the analyses. Thought I could do this from the github instructions but seems that is not the case, the reference genome is chosen during the first analysis maast genome... and then we have to continue using that genome. In the end_to_end analysis the software chooses one genome to serve as the reference and I really can't figure out how the software chose the genome it does (not the biggest and certainly not the most complete in my dataset).

drhoads commented 6 months ago

As a followup when I use the following command: maast genomes --fna-dir /mnt/c/DNAwork/ChickenDiseaseSSR/ --rep-fna /mnt/c/DNAwork/ChickenDiseaseSSR/SA1.fna --out-dir /mnt/c/DNAwork/ChickenDiseaseSSR/maastwork The reference genome still comes up as a different genome...... (not SA1.fna). This makes it far more difficult to compare the results from ParSNP and Maast. With ParSNP I have SNPs that for the Alt SNPs in several annotated genes are coming up 100% and for Maast the SNPs in that gene are coming up 50-60%.

zjshi commented 6 months ago

Hi drhoads, thanks for using maast. Yes, maast allows to specify any reference genome of choice with --rep-fna and --skip-centroid. I'll appreciate it if you could please give a quick try and let me know whether it solves the issue here. For your question about the proportion of SNPs recovered, I would lower the value of --min-prev and --snp-freq for more sensitive calling. In brief, --min-prev specifies the minimum proportion of genomes a genomic site has to be present to be consider for SNP calling, and --snp-freq specifies the minimum minor allele frequency for a valid SNP. The default value of --min-prev is 1, which is very stringent and zero-tolerant of deletion. Perhaps consider lowering it to 0.9 (which is also used in the paper) for a more comparable analysis together with ParSNP. Technically, SNP calling by maast can be made extremely sensitive when low values of --min-prev and --snp-freq are used. Please feel free to follow up if that is not the case for you.

drhoads commented 6 months ago

My understanding is that --rep-fna is only an option on maast genomes because when I tried it with maast end_to_end I got an error that -rep-fna was not an option. When I used this command: maast end_to_end --in-dir /mnt/f/DNAwork/Ececorum/maastwork/ChickenDiseaseSSR --out-dir /mnt/f/DNAwork/Ececorum/maastwork/MaastCDSSR The run completes and uses the reference it chooses. I can't add --rep-fna (and --skip-centroid) to that command?

I think you may have misunderstood my comment about: "With ParSNP I have SNPs that for the Alt SNPs in several annotated genes are coming up 100% and for Maast the SNPs in that gene are coming up 50-60%."

When I ran ParSNP and Maast on the same set of genomes ParSNP returned far fewer SNPs and the most abundant SNPs were freq of 1.0 and 0.94. For Maast the most abundant SNPS have a freq of only 0.54. These freq come from counting the GT 1 from ParSNP and dividing by the number of genotypes in a phenotypic trait. I do the same for Maast output for Genotype 0:1:0:0

zjshi commented 6 months ago

My understanding is that --rep-fna is only an option on maast genomes because when I tried it with maast end_to_end I got an error that -rep-fna was not an option.

Hmm... That is not expected. Did you receive the similar error when you try the test dataset? Here is what I got: image

When I ran ParSNP and Maast on the same set of genomes ParSNP returned far fewer SNPs and the most abundant SNPs were freq of 1.0 and 0.94.

Oh I see, thanks for clarifying. The first part of your results, "When I ran ParSNP and Maast on the same set of genomes ParSNP returned far fewer SNPs ", is in line with our previous observation with ParSNP, and it actually motivated us to develop maast at the first place...My understanding is that ParSNP will do a screening over all input genomes, and then use a algorithm to choose similar genomes for (i.e. exclude divergent ones from) SNP calling. I am not 100% sure whether this is related to the freq discrepancy you have observed but might worth a closer look. It appears to me that your calculation is correct, but just in case, I would like to clarify how the maast summarizes alleles for any given SNP in a format of GP1:GP2:GP3:GP4. For any sample/genome, the format of GP1:GP2:GP3:GP4 can have five possible values: 0:0:0:0 (allele is missing), 1:0:0:0 (allele 1 or column 4), 0:1:0:0 (allele 2 or the first allele in column 5) , 0:0:1:0 (allele 3 or the second allele in column 5 if available) and , 0:0:0:1 (allele 4 or the third allele in column 5 if available). Let me know please if you have further questions.

drhoads commented 6 months ago

Everthing completed when I ran with --ref-fna and --ski-centroid and did it in the 3 step process of running maast genomes, maast db, and maast genotype. Tomorrow I will try to recreate trying to use maast end_to_end with --ref-fna to see what the error message truly said. Right now I need to go prepare dinner for the spouse. :)

Just so you know the results from maast were much more elucidating for my questions than the results from ParSNP and I am trying to finish up a manuscript for Microorganisms that they want the submission next week. I am working in poultry and we have an outbreak of sepsis from Enterococcus cecorum. This bacterium has typically been associated with osteomyelitis but this new disease trait is a much bigger issue. I have been working on discerning whether it was gene acquisition, mutations, or something else (like the banning of antibiotics!!). The maast data gave me some SNPs that gave me proteins to look for coding variants and I got more than 20 from maast and only 7 from ParSNP (with two shared). Only the two shared from ParSNP held up from the protein variation work but I got 5 additional big finds from maast. Further, the maast SNP frequencies come closer to matching the protein var frequencies.

zjshi commented 6 months ago

This might not be related to the error you saw, but maast uses --rep-fna to specify reference genome instead of --ref-fna. Anyway, everything sounds really good. I am glad that maast worked and is helpful with your study. I also appreciate it that you kindly shared more background of your work, which is very meaningful!! I look forward to reading it in the future. I will close this issue for now, but please feel free to reopen it anytime if you have further questions and run into more problems.

drhoads commented 6 months ago

FYI- I was able to get the maast end_to_end command to run with the --rep-fna --skip-centroid using the specified reference genome. If I use the same command but leave out the --skip-centroid the program chooses a different reference.
I may be misunderstanding the parameters for --snp-freq. I tried a run setting --min-prev at 0.9 and --snp-freq at 0.1 (minimum MAF of 10%, and the final core_snps.vcf only had genotypes for 10 of the 135 genomes, including no genotypes for the reference SA1.

command='end_to_end --in-dir /mnt/f/DNAwork/Ececorum/maastwork/ChickenDiseaseSSR --rep-fna /mnt/f/DNAwork/Ececorum/maastwork/ChickenDiseaseSSR/SA1.fna --skip-centroid --min-prev 0.9 --snp-freq 0.1 --out-dir /mnt/f/DNAwork/Ececorum/maastwork/maastCDSSR_SA1minprevSNPfreq'

without setting --min-prev and --snp-freq I got genotypes for all 135 genomes for 108285 SNPs. When I set those two parameters I get genotypes for 10 of the 135 genomes for 116329 SNPs.

zjshi commented 6 months ago

Sounds good, again. --snp-freq specifies the minor allele frequency that allows maast to determine how many genomes it will sample for effective SNP calling. This feature is useful when there are a massive number of input genomes. In your use case, I would use --keep-redundancy to decouple --snp-freq from the genome down sampling process and force maast to use all 136 genomes for SNP calling. Let me know please if I misunderstood the issue raised here.

drhoads commented 6 months ago

Just a followup on my progress. I really appreciate you providing additional insights into switches and settings for using Maast. The manuscript using that data will go into Microorganisms this week. FYI- I tried running the dataset using several different parameters and then filtering for the most informative SNPs for the phenotype I am pursuing. The dataset contains 138 genomes of pathogens of which 22 are obtained from sepsis, the rest are from osteomyelitis. The question from the industry has been whether these sepsis isolates represent a new genotype. I had eliminated gene acquisition but have been pursuing core genome mutations that might be critical for the switch in pathogen niche. For 1-3 I specified the reference genome and for 4 I let maast choose. switches Total SNPs GT1 SNPs >50% GT1 diff>=0.2 # GT1=0:1:0:0 1) -min-prev 0.9 --keep redundancy 136434 283 97 2) --min-prev 0.9 -snp-freq default 136434 283 97 3) defaults for --min-prev and --snp-freq 108285 252 77 4) maast chooses ref using defaults 108200 248 47

For the publication I went with the data from #3, but it does look like a few more SNPs might be mined with options 1 or 2 which appear to be equivalent.

I now have 3 other genera to turn my attention to. Again, thanks for all your help and for producing a great set of software.

From: Zhou (Jason) Shi @.> Sent: Sunday, December 31, 2023 2:48 AM To: zjshi/Maast @.> Cc: Douglas Duane Rhoads @.>; Author @.> Subject: Re: [zjshi/Maast] Will there be an option to truly specify the reference genome (Issue #22)

Sounds good, again. --snp-freq specifies the minor allele frequency that allows maast to determine how many genomes it will sample for effective SNP calling. This feature is useful when there are a massive number of input genomes. In your use case, I would use --keep-redundancy to decouple --snp-freq from the genome down sampling process and force maast to use all 136 genomes for SNP calling. Let me know please if I misunderstood the issue raised here.

- Reply to this email directly, view it on GitHubhttps://github.com/zjshi/Maast/issues/22#issuecomment-1872868521, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIX22VTHL7FGSUFAEPTZ7UDYMEREZAVCNFSM6AAAAABBBYE7I6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSHA3DQNJSGE. You are receiving this because you authored the thread.Message ID: @.**@.>>