Closed abulenciamiguel closed 2 years ago
Thanks for including all of your commands, that makes it very clear. The web interface runs this usher command:
usher -v vcfFile -i protobufFile -d tmpOutputDir -k subtreeSize -K 2000 -T numThreads
where vcfFile
is generated by the web interface from alignments of uploaded sequences, protobufFile
is the tree selected in the web interface, subtreeSize
is the size selected in the web interface (a number up to 5000), and numThreads
is set to an appropriate value for the web server.
The difference is in how the VCF file is generated. Did you by any chance adapt your script from usher/workflows/Snakefile? I see now it seems to be missing something important in how it's calling faToVcf. faToVcf expects the reference sequence to be the first sequence in the MSA fasta input. If only new sequences are included in the input to faToVcf, then the first new sequence will be used as reference, leading to incorrect output VCF.
Please give this a try, adding a new command before faToVcf to add the reference sequence before the aligned new sequences, and passing the new file to faToVcf:
cat postArtic/usher/NC_045512v2.fa postArtic/usher/myAlignedSequences.fa \
> postArtic/usher/myAlignedSequencesWithRef.fa
faToVcf -maskSites=postArtic/usher/problematic_sites_sarsCov2.vcf \
postArtic/usher/myAlignedSequencesWithRef.fa \
postArtic/usher/myAlignedSequences.vcf
It is also helpful to add the option -includeNoAltN
to faToVcf, to explicitly tell usher where the new sequences have Ns that could match either reference or alternate allele.
Sorry that we're not demonstrating the correct use of faToVcf, and thanks for alerting us to the problem!
Hello, @AngieHinrichs. Thank you for the quick help. I can now reproduce the results from the web version in our local workstation. For others that might be interested, I used the following parameters as suggested. Take note that -includeNoAltN
in faToVcf
is important.
# USHER Chunk
mkdir -p postArtic/usher
# Downloads the updated sars-cov-2 dataset
wget -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_045512.2&rettype=fasta" | \
sed '1 s/^.*$/>NC_045512v2/' > postArtic/usher/NC_045512v2.fa
# Alignment
unset MAFFT_BINARIES # Done if there is a problem in the configuration of the shell
mafft --thread 20 --auto --keeplength --addfragments \
articNcovNanopore_prepRedcap_concatenate_process/all_sequences.fasta \
postArtic/usher/NC_045512v2.fa > postArtic/usher/myAlignedSequences.fa
# Inserting reference genome on the first line
cat postArtic/usher/NC_045512v2.fa postArtic/usher/myAlignedSequences.fa \
> postArtic/usher/myAlignedSequencesWithRef.fa
# Downloads the problematic sites in ref genome
wget -O - "https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf" > \
postArtic/usher/problematic_sites_sarsCov2.vcf
# Converts fasta to vcf with correction for the problematic sites
faToVcf -includeNoAltN -maskSites=postArtic/usher/problematic_sites_sarsCov2.vcf \
postArtic/usher/myAlignedSequencesWithRef.fa \
postArtic/usher/myAlignedSequences.vcf
# Downloads the latest global lineages
wget -O - "http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz" | \
gunzip -c > postArtic/usher/public-latest.all.masked.pb
# Runs usher for lineage assignment
usher -i postArtic/usher/public-latest.all.masked.pb \
-v postArtic/usher/myAlignedSequences.vcf -k 50 -K 2000 -T 30 -d postArtic/usher/result
Thank you again.
Thank you for confirming @ufuomababatunde!
BTW if you are a registered user at https://gisaid.org/ then I can share the full tree including GISAID sequences which has about twice as many sequences, and from many more countries, than the public-sequence-only tree.
Yes, we do have an account in GISAID. Will I be able to download the full tree in in a regular basis similar to what is done with the public-sequence-only tree?
OK, please contact me at angie at soe dot ucsc dot edu (obfuscated to avoid spambots).
Hello.
I ran usher using the CLI but got a different lineage assignment from the web version. Example: I got
B.1.1.529
from the CLI whileBA.5
in the web version.Can I ask what parameters were used in the web version so as to replicate its results in the CLI?
By the way, I used the following commands: