yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
122 stars 41 forks source link

Parameters used in web version #274

Closed abulenciamiguel closed 2 years ago

abulenciamiguel commented 2 years ago

Hello.

I ran usher using the CLI but got a different lineage assignment from the web version. Example: I got B.1.1.529 from the CLI while BA.5 in the web version.

Can I ask what parameters were used in the web version so as to replicate its results in the CLI?

By the way, I used the following commands:

    # USHER Chunk
    mkdir -p postArtic/usher

    # Downloads the updated sars-cov-2 dataset
    wget -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_045512.2&rettype=fasta" | \
    sed '1 s/^.*$/>NC_045512v2/' > postArtic/usher/NC_045512v2.fa

    # Alignment
    unset MAFFT_BINARIES # Done if there is a problem in the configuration of the shell

    mafft --thread 20 --auto --keeplength --addfragments \
    articNcovNanopore_prepRedcap_concatenate_process/all_sequences.fasta \
    postArtic/usher/NC_045512v2.fa > postArtic/usher/myAlignedSequences.fa

    # Downloads the problematic sites in ref genome
    wget -O - "https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf" > \
   postArtic/usher/problematic_sites_sarsCov2.vcf

    # Converts fasta to vcf with correction for the problematic sites
    faToVcf -maskSites=postArtic/usher/problematic_sites_sarsCov2.vcf \
    postArtic/usher/myAlignedSequences.fa \
    postArtic/usher/myAlignedSequences.vcf

    # Downloads the latest global lineages
    wget -O - "http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz" | \
    gunzip -c > postArtic/usher/public-latest.all.masked.pb

    # Runs usher for lineage assignment
    usher -i postArtic/usher/public-latest.all.masked.pb \
    -v postArtic/usher/myAlignedSequences.vcf -u -d postArtic/usher/result
AngieHinrichs commented 2 years ago

Thanks for including all of your commands, that makes it very clear. The web interface runs this usher command:

usher -v vcfFile -i protobufFile -d tmpOutputDir -k subtreeSize -K 2000 -T numThreads

where vcfFile is generated by the web interface from alignments of uploaded sequences, protobufFile is the tree selected in the web interface, subtreeSize is the size selected in the web interface (a number up to 5000), and numThreads is set to an appropriate value for the web server.

The difference is in how the VCF file is generated. Did you by any chance adapt your script from usher/workflows/Snakefile? I see now it seems to be missing something important in how it's calling faToVcf. faToVcf expects the reference sequence to be the first sequence in the MSA fasta input. If only new sequences are included in the input to faToVcf, then the first new sequence will be used as reference, leading to incorrect output VCF.

Please give this a try, adding a new command before faToVcf to add the reference sequence before the aligned new sequences, and passing the new file to faToVcf:

    cat postArtic/usher/NC_045512v2.fa postArtic/usher/myAlignedSequences.fa \
    > postArtic/usher/myAlignedSequencesWithRef.fa

    faToVcf -maskSites=postArtic/usher/problematic_sites_sarsCov2.vcf \
    postArtic/usher/myAlignedSequencesWithRef.fa \
    postArtic/usher/myAlignedSequences.vcf

It is also helpful to add the option -includeNoAltN to faToVcf, to explicitly tell usher where the new sequences have Ns that could match either reference or alternate allele.

Sorry that we're not demonstrating the correct use of faToVcf, and thanks for alerting us to the problem!

abulenciamiguel commented 2 years ago

Hello, @AngieHinrichs. Thank you for the quick help. I can now reproduce the results from the web version in our local workstation. For others that might be interested, I used the following parameters as suggested. Take note that -includeNoAltN in faToVcf is important.

    # USHER Chunk
    mkdir -p postArtic/usher

    # Downloads the updated sars-cov-2 dataset
    wget -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_045512.2&rettype=fasta" | \
    sed '1 s/^.*$/>NC_045512v2/' > postArtic/usher/NC_045512v2.fa

    # Alignment
    unset MAFFT_BINARIES # Done if there is a problem in the configuration of the shell

    mafft --thread 20 --auto --keeplength --addfragments \
    articNcovNanopore_prepRedcap_concatenate_process/all_sequences.fasta \
    postArtic/usher/NC_045512v2.fa > postArtic/usher/myAlignedSequences.fa

    # Inserting reference genome on the first line
    cat postArtic/usher/NC_045512v2.fa postArtic/usher/myAlignedSequences.fa \
    > postArtic/usher/myAlignedSequencesWithRef.fa

    # Downloads the problematic sites in ref genome
    wget -O - "https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf" > \
    postArtic/usher/problematic_sites_sarsCov2.vcf

    # Converts fasta to vcf with correction for the problematic sites
    faToVcf -includeNoAltN -maskSites=postArtic/usher/problematic_sites_sarsCov2.vcf \
    postArtic/usher/myAlignedSequencesWithRef.fa \
    postArtic/usher/myAlignedSequences.vcf

    # Downloads the latest global lineages
    wget -O - "http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz" | \
    gunzip -c > postArtic/usher/public-latest.all.masked.pb

    # Runs usher for lineage assignment
    usher -i postArtic/usher/public-latest.all.masked.pb \
    -v postArtic/usher/myAlignedSequences.vcf -k 50 -K 2000 -T 30 -d postArtic/usher/result

Thank you again.

AngieHinrichs commented 2 years ago

Thank you for confirming @ufuomababatunde!

AngieHinrichs commented 2 years ago

BTW if you are a registered user at https://gisaid.org/ then I can share the full tree including GISAID sequences which has about twice as many sequences, and from many more countries, than the public-sequence-only tree.

abulenciamiguel commented 2 years ago

Yes, we do have an account in GISAID. Will I be able to download the full tree in in a regular basis similar to what is done with the public-sequence-only tree?

AngieHinrichs commented 2 years ago

OK, please contact me at angie at soe dot ucsc dot edu (obfuscated to avoid spambots).