yatisht / usher

Ultrafast Sample Placement on Existing Trees
MIT License
120 stars 40 forks source link

Too many sequences in an output tree #358

Open imdanique opened 8 months ago

imdanique commented 8 months ago

Hi, thanks for developing Usher.

I'm encountering an issue. My goal is to find the 10 nearest neighbors for each sequence in a FASTA file and build a tree with input sequences + neighbors. Here's what I did:


   #Alignment
    mafft --thread 16  --keeplength --addfragments all_seq.fasta ref.fa > aligned_seq.fa

    # Ensuring that first seq is ref
    head -1 aligned_seq.fa

    # Downloads the problematic sites in ref genome
    wget -O - "https://raw.githubusercontent.com/W-L/ProblematicSites_SARS-CoV2/master/problematic_sites_sarsCov2.vcf" > problematic_sites_sarsCov2.vcf

    # Converts fasta to vcf with correction for the problematic sites
    faToVcf -includeNoAltN -maskSites=problematic_sites_sarsCov2.vcf aligned_seq.fa aligned_seq.vcf

    # Downloads the latest global lineages
    wget -O - "http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz" 

    # Runs usher for lineage assignment
    usher -i public-latest.all.masked.pb \
    -v aligned_seq.vcf -k 10  -T 16 -d result 

Then, I tried to calculate all IDs in output subtrees:


from Bio import Phylo
import glob
# Initialize an empty list to store all the trees
trees = []

# Loop through all the files containing the subtrees
for file in glob.glob("result/*.nh"):
    tree = Phylo.read(file, "newick")
    # Extract sequence IDs from the tree and add them to the list
    sequence_ids.extend([leaf.name for leaf in tree.get_terminals()])

# Open a text file to write the sequence IDs
with open('result/sequence_ids.txt', 'w') as f:
    for id in sequence_ids:
        f.write(id + '\n')

Calculated all unique entries:

sort usher_results_k10/sequence_ids.txt | uniq | grep -v "node" | wc -l
>  2595922

In my input I had 1909 sequences. I expected a maximum of 19,090 output sequences (10 x input sequences), but I ended up with 2,595,922 sequences in 1152 output trees. How do I modify the Usher command to provide only 10 neighbors? Any insights on what went wrong would be greatly appreciated.

AngieHinrichs commented 8 months ago

Hi @imdanique. In addition to the -k output files subtree-.nh, usher also makes an output file for the whole tree, final-tree.nh, and the `glob("result/.nh")in your Python script might be picking that up as well. Try changing that toglob("result/subtree-*.nh")` to match only the subtree output files, and let us know if that doesn't fix it.