zjshi / Maast

Microbial agile accurate SNP Typer
MIT License
24 stars 2 forks source link

[Question] How to interpret the `gt_results` in relation to the --ref-fna? #21

Open jolespin opened 7 months ago

jolespin commented 7 months ago

Running version: maast=1.0.8=py310hc2b1e32_0

Running the following command:

maast end_to_end --in-dir assemblies/DENV2/ --rep-fna References/DENV2.fa  --out-dir maast_output/DENV2/ --skip-centroid --keep-redundancy

Trying to understand the output better:

Local Pos: up to seven digits which specifies the local position of a SNP on a contig Global Pos: up to seven digits which specifies the global position of a SNP in a species, served as sort of ID

I’m assuming that Column 2 and 3 are one and the same and identifies the NT position of the SNP.

Allele 1: single character, A, C, G or T, which specifies allele 1 of a SNP I

I am not sure if this base is referring to the refeq’s (wildtype) base? That is the DENV.fa file provided for --rep-fna

Allele 2: similiar as Ref allele but specifies allele 2 of a SNP

Does this refer to the SNP of the sample that is different from Column 4 and hence identifies the SNP?

zjshi commented 6 months ago

Hi jolespin, thanks for trying maast.

I’m assuming that Column 2 and 3 are one and the same and identifies the NT position of the SNP.

They are not the same. Column 2 is the position of the SNP on chromosome (Column 1), Column 3 is the unique identifier that is derived from the global genomic position. For example, three SNPs, A, B and C, can have the following values in their first three columns: Chrom 1 Pos 1 ID 1, Chrom 1 Pos 2 ID 2, Chrom 2 Pos 1 ID 3. As you can see, column 3 always increases thus will be unique, and column 2 always indicates the position on chomosome.

I am not sure if this base is referring to the refeq’s (wildtype) base? That is the DENV.fa file provided for --rep-fna Does this refer to the SNP of the sample that is different from Column 4 and hence identifies the SNP?

Good questions! These two questions are related and hence are answered here together. Allele 1 is actually the major allele identified in the input genomes, and Allele 2 is the minor one. Allele 1 serves as reference allele but it does not mean it is always the one on the reference genome (perhaps calling it representative genome is less confusing). Practically, it allows to solve the conflicts when the actual allele on the reference genome is extremely rare.

Let me know please if you have further questions.

jolespin commented 6 months ago

I'm still not entirely clear to be honest. Would you be able to do a detail explanation on how to interpret this using reads data and separately using contigs as input on your main repo? Maybe a FAQ section? I'd love how easy it was to use this tool but I'm not sure how to interpret the results so I can't really use it in the paper I'm working on at the moment.

zjshi commented 6 months ago

We actually had a demonstration study which did a bit more detailed walk through of results interpretation. Check it out at https://www.sciencedirect.com/science/article/pii/S2666166722008449. Please feel free to leave your suggestions on FAQ, too. Thanks.