Closed husensofteng closed 5 years ago
You need to use-max_target_seqs 1
in BLASTP to output only top hit.
Yes, you are right that is the expected behaviour, however as I show below, with the commands used for running blast in IPAW we do get more than one output per sequence query.
cat ref.fa
prot1 YAEYQVCLAAVGLVGDLCRALQSNILPFCDEVMQLLLENLGNENVHRSVKPQILSVFGDI ENVHPDVMLVQPRVEFILSFIDHIAGDEDHTDGVVACAAGLIGDLCTAFGKDVLKLVEAR
cat query.fa
seq1 AAGLIGDLC
run
blastp -db ref.fa -query query.fa -outfmt '6 qseqid sseqid pident qlen slen qstart qend sstart send mismatch positive gapopen gaps qseq sseq evalue bitscore' -num_threads 16 -evalue 1000 -out output.tsv -max_target_seqs 1
cat output.tsv
AAGLIGDLC prot1 100.00 9 120 1 9 98 106 0 9 0 0 AAGLIGDLC AAGLIGDLC 9e-05 21.2 AAGLIGDLC prot1 77.78 9 120 1 9 10 18 2 8 0 0 AAGLIGDLC AVGLVGDLC 4e-04 19.2
Hmm, i didn't consider this could happen. I will update the code storing the top hit only. Thanks for pointing it out!
Updated now. 8e5411a1d8153f6695582cdf5518c4f3fa82a54d
great thanks.
In cases where multiple hits are found for a single query fasa, only the category of the latests hit is reported and the previous ones are overwritten. This may result in marking a peptide as novel solely based on the last hit while a previous hit could have matched a known protein. https://github.com/yafeng/proteogenomics_python/blob/226ee35e30a5d82ab634402c99130d35454df000/parse_BLASTP_out.py#L38 to https://github.com/yafeng/proteogenomics_python/blob/226ee35e30a5d82ab634402c99130d35454df000/parse_BLASTP_out.py#L83