zx0223winner / HSDFinder

a tool to predict highly similar duplicates (HSDs) in eukaryotes
MIT License
2 stars 1 forks source link

KeyError: 'g11714.t1' found when running HSDFinder #2

Open zx0223winner opened 1 year ago

zx0223winner commented 1 year ago

Here is the enquiry email sending from a current user which might help those who have similar concerns.

I'm trying to run the HSD finder but ending up with an error, suggestions please. Are there any python packages to be installed?

python3 /home/sunn/data/softwares/HSDFinder/HSDFinder.py -i Aven.fa_BLASTP_species.txt -p 90.0 -l 10 -f Aven.fa.tsv -t Pfam  -o Aven.species.txt  
Traceback (most recent call last):
  File "/home/sunn/data/softwares/HSDFinder/HSDFinder.py", line 104, in <module>
    main(sys.argv[1:])
  File "/home/sunn/data/softwares/HSDFinder/HSDFinder.py", line 99, in main
    result = operation.pfam_file_fun(input_file, percentage, length, pfam, p_type, output_file)
  File "/home/sunn/data/softwares/HSDFinder/operation.py", line 25, in pfam_file_fun
    output = pfam.step(lines, p_filter, s_length)
  File "/home/sunn/data/softwares/HSDFinder/pfam.py", line 41, in step
    lengtha = int(genes[items[0]])
KeyError: 'g11714.t1'
zx0223winner commented 1 year ago

Hi this is a common error in HSDFinder, especially for species rich of gene duplicates. I detailed the reason and solution for users with the link below, let me know if you still have trouble with the new merged Blast result file. Please go to the link and search for the question below:

https://github.com/zx0223winner/HSDFinder#how-to-deal-with-error-require-length-of-gene-

How to deal with Error: require length of gene ?

~Xi

zx0223winner commented 1 year ago

Hi

This is same error and can be easily fixed Looks like in your new blast all-against-all file, you are still missing the length info lines like below if you search it . Make sure you run the Unix command on your protein data to acquire all the length info lines for each protein gene.

XP_034417599.1 XP_034417599.1 100 30256

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' '/.../.../protein.fa' |paste - - |sed 's/>//g'|awk -F'\t' '{print $1"\t"$1"\t"100"\t"$2}' >##.protein.length.aa

~Xi