关于困难蛋白质 - Githubissues

yourh / DeepGraphGO

DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction

34 stars 11 forks source link

关于困难蛋白质 #20

Open zhanght28 opened 7 months ago

zhanght28 commented 7 months ago

the definition of the difficult proteins is: the sequence identity of the protein (in the training set) most similar (homologous) to a difficult protein is less than 60%. 你好，请问困难蛋白质的数据是通过max(hsp.identities / rec.query_length for hsp in alignment.hsps) < 0.6得到的吗？我基于此得到的cc mf bp上的困难蛋白质在数量上和论文中给出的有10个左右的偏差。

yourh commented 7 months ago

不是很确定为什么，BLAST输出就有一个0~1之间的identity，然后cutoff是0.6，我用的BLAST迭代次数是1

zhanght28 commented 7 months ago

感谢您的回复，我是用测试集的psiblast的查询结果xx-test-ppi-blast-out.xml为依据查询的，可能需要用psiblast跑一下训练集的结果？

yourh commented 7 months ago

哦，是，要跑训练集的

zhanght28 commented 7 months ago

identity是不是也要根据blast的结果进一步计算得到呢

yourh commented 7 months ago

是的，BLAST的输出结果里直接就有identity，然后是选所有hsp里最大的

zhanght28 commented 7 months ago

我是通过： max(hsp.identities / rec.query_length for hsp in alignment.hsps) 计算的，这个结果计算出来有偏差，所以我在考虑是不是计算方式有问题