yourh / DeepGraphGO

DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction
34 stars 11 forks source link

关于困难蛋白质 #20

Open zhanght28 opened 6 months ago

zhanght28 commented 6 months ago

the definition of the difficult proteins is: the sequence identity of the protein (in the training set) most similar (homologous) to a difficult protein is less than 60%. 你好,请问困难蛋白质的数据是通过max(hsp.identities / rec.query_length for hsp in alignment.hsps) < 0.6得到的吗? 我基于此得到的cc mf bp上的困难蛋白质在数量上和论文中给出的有10个左右的偏差。

yourh commented 6 months ago

不是很确定为什么,BLAST输出就有一个0~1之间的identity,然后cutoff是0.6,我用的BLAST迭代次数是1

zhanght28 commented 6 months ago

感谢您的回复,我是用测试集的psiblast的查询结果xx-test-ppi-blast-out.xml为依据查询的,可能需要用psiblast跑一下训练集的结果?

yourh commented 6 months ago

哦,是,要跑训练集的

zhanght28 commented 6 months ago

identity是不是也要根据blast的结果进一步计算得到呢

yourh commented 6 months ago

是的,BLAST的输出结果里直接就有identity,然后是选所有hsp里最大的

zhanght28 commented 6 months ago

我是通过: max(hsp.identities / rec.query_length for hsp in alignment.hsps) 计算的,这个结果计算出来有偏差,所以我在考虑是不是计算方式有问题