yourh / DeepGraphGO

DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction
32 stars 11 forks source link

关于InterProScan生成蛋白质特征 #8

Closed Alexzhuan closed 2 years ago

Alexzhuan commented 2 years ago

作者您好,关于InterProScan生成蛋白质特征,有个问题想请教下: 你们在使用interproscan时,用了哪些application: Available analyses: TIGRFAM (XX.X) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs; SFLD (X.X) : SFLDs are protein families based on Hidden Markov Models or HMMs; amap (XXXXXX.XX) : High-quality Automated and Manual Annotation of Microbial Proteomes; SMART (X.X) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs; CDD (X.XX) : Prediction of CDD domains in Proteins; ProSiteProfiles (XX.XXX) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them; ProSitePatterns (XX.XXX) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them; SUPERFAMILY (X.XX) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes; PRINTS (XX.X) : A fingerprint is a group of conserved motifs used to characterise a protein family; PANTHER (X.X) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence; Gene3D (X.X.X) : Structural assignment for whole genes and genomes using the CATH domain structure database; PIRSF (X.XX) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships; Pfam (XX.X) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs); Coils (X.X) : Prediction of Coiled Coil Regions in Proteins; MobiDBLite (X.X) : Prediction of disordered domains Regions in Proteins.

yourh commented 2 years ago

都用了,除了PANTHER

Alexzhuan commented 2 years ago

还有个问题想请教下,你们是如何将interproscan分析输出结果转换为binary feature的,能提供相关代码吗?谢谢!

yourh commented 2 years ago

interproscan输出的结果就是每条蛋白质相关的domain/family/motif,然后转成one-hot特征就行了。这部分代码太简单了没整理,所以没提供。