yafeng / proteogenomics_python

python scripts for proteogenomics analysis
10 stars 10 forks source link

parse blast output script keeps the category only from the last hit #5

Closed husensofteng closed 5 years ago

husensofteng commented 5 years ago

In cases where multiple hits are found for a single query fasa, only the category of the latests hit is reported and the previous ones are overwritten. This may result in marking a peptide as novel solely based on the last hit while a previous hit could have matched a known protein. https://github.com/yafeng/proteogenomics_python/blob/226ee35e30a5d82ab634402c99130d35454df000/parse_BLASTP_out.py#L38 to https://github.com/yafeng/proteogenomics_python/blob/226ee35e30a5d82ab634402c99130d35454df000/parse_BLASTP_out.py#L83

yafeng commented 5 years ago

You need to use-max_target_seqs 1 in BLASTP to output only top hit.

husensofteng commented 5 years ago

Yes, you are right that is the expected behaviour, however as I show below, with the commands used for running blast in IPAW we do get more than one output per sequence query.

cat ref.fa

prot1 YAEYQVCLAAVGLVGDLCRALQSNILPFCDEVMQLLLENLGNENVHRSVKPQILSVFGDI ENVHPDVMLVQPRVEFILSFIDHIAGDEDHTDGVVACAAGLIGDLCTAFGKDVLKLVEAR

cat query.fa

seq1 AAGLIGDLC

run blastp -db ref.fa -query query.fa -outfmt '6 qseqid sseqid pident qlen slen qstart qend sstart send mismatch positive gapopen gaps qseq sseq evalue bitscore' -num_threads 16 -evalue 1000 -out output.tsv -max_target_seqs 1

cat output.tsv

AAGLIGDLC prot1 100.00 9 120 1 9 98 106 0 9 0 0 AAGLIGDLC AAGLIGDLC 9e-05 21.2 AAGLIGDLC prot1 77.78 9 120 1 9 10 18 2 8 0 0 AAGLIGDLC AVGLVGDLC 4e-04 19.2

yafeng commented 5 years ago

Hmm, i didn't consider this could happen. I will update the code storing the top hit only. Thanks for pointing it out!

yafeng commented 5 years ago

Updated now. 8e5411a1d8153f6695582cdf5518c4f3fa82a54d

husensofteng commented 5 years ago

great thanks.