wshuai294 / PStrain

Pstrain profiles strains in metagenomics data. It infers strain abundance and genotype for each species. Also, it has a single species mode; where given a BAM and VCF, it can phase the variants for any species.
MIT License
8 stars 2 forks source link

metaphlan3 support? #1

Closed nick-youngblut closed 3 years ago

nick-youngblut commented 3 years ago

Are there any plans to support metaphlan3 databases?

wshuai294 commented 3 years ago

Thank you for your suggestion. I will make PStrain support metaphlan3 in the future. I will collect comments and update a new version as soon as possible.

ShriramHPatel commented 3 years ago

Thank you for making this tool open source. Just checking in any update to support metaphlan3 database yet?

Additionaly, my query is related to prior database that is provided with this tool. I assume (as per the supplement section 1.2) it is specifically generated from the marker sequences from metaphlan2 database right?

I have reformatted metaphlan3 database (for species of my interest) to make it compatible with PStrain, but can I use default prior database with the output? If not can you provide some guidance/ instruction on how to prepare it (I am only focusing on very few bacteria species so could be easy to prepare on my end).

Thank you

wshuai294 commented 3 years ago

Hi,

Thank you for your concern.

Up to now, I have not updated PStrain to support metaphlan3 yet. Sorry for the delay.

For the prior database, it is indeed generated for the marker genes from metaphlan2 database. However, you can use it directly. It will assume the prior beta is 0.5 if not finding information from the prior database. The performance is good enough without the prior database except for the gene with too low depth.

Also, you can generate the prior database by yourself. You can follow these steps:

  1. Run PStrain directly to get the bam file in each sample, and call SNP with GATK in gvcf formate which will indicate the depth for each base.
  2. Merge the SNP by computing mean frequency of each base in all samples. And generate a file named "prior_beta.txt" in this formate:
    locus   A       T       C       G
    gi|378760316|ref|NZ_AGDG01000038.1|:8694-9629|Bacteroides_faecis_517    1       0       0       0
    gi|410936685|ref|NZ_AJRF02000012.1|:54814-55320|Enterococcus_faecium_299        0       1       0       0
    gi|139439993|ref|NZ_AAVN02000016.1|:7311-8633|Collinsella_aerofaciens_783       0.0102827763496144      0.0205655526992288      0       0.969151670951157
    gi|239633764|ref|NZ_GG670286.1|:165060-166319|Enterococcus_gallinarum_626       0       0       0       1
    gi|345651640|ref|NZ_JH114383.1|:276898-279936|Bacteroides_vulgatus_2725 0       0       0       1
    GeneID:16385027|Lactococcus_phage_jm2_95        1       0       0       0
    gi|488673303|ref|NZ_KB850950.1|:c879018-878179|Clostridium_hathewayi_511        0       0       0       1
  3. Then you can convert it to pickle formate by
    import pickle
    popu={}
    f=open('prior_beta.txt','r')
    for line in f:
    break
    for line in f:
    line=line.strip()
    array=line.split()
    atcg=array[1:]
    popu[array[0]]=atcg  
    with open('prior_beta.pickle', 'wb') as f:
     pickle.dump(popu, f, pickle.HIGHEST_PROTOCOL) 

    Please let me know if it helps.

Best, Shuai

ShriramHPatel commented 3 years ago

Thank you for your prompt reply and useful instruction. That was helpful. Best,

wshuai294 commented 3 years ago

Are there any plans to support metaphlan3 databases?

Sorry for the delay, now PStrain supports metaphlan3. You can use PStrain_V30.py to do this.