wikipathways / academy

Organized training examples for new WikiPathways authors
Creative Commons Zero v1.0 Universal
2 stars 4 forks source link

New stage describing how to model protein families in pathways #31

Open khanspers opened 1 year ago

khanspers commented 1 year ago

From a curation report issue, how to best model protein families in pathways. Either use Pfam id on a single node representing the protein family, or list out all members of the protein family (if feasible).

danidi commented 1 year ago

The Pfam database is retired, and included in interpro. Will interpro identifiers work as well?

Chris-Evelo commented 1 year ago

Having InterPro identifiers work would be great anyway. But what I read at the Interpro website is that they will host PFAM which sounds different from using Interpro identifiers instead. Does anybody know how that really works?

danidi commented 1 year ago

It seems like PFAM is still actively providing content, but this will be found via the interprot webpage only: https://xfam.wordpress.com/2022/08/04/pfam-website-decommission/ In the interprot search, you can then see a list of results coming from the different sources.

khanspers commented 1 year ago

For this particular case, the best match I could find is this InterPro identifier, for Ribosomal protein S6 kinase: https://www.ebi.ac.uk/interpro/entry/InterPro/IPR016238/

I added a stage here, using the pathway from the curation report as the example: https://academy.wikipathways.org/stages/draw-protein-families/ (not yet integrated in the path). Please review.

One thing to add is a comment about data mapping (i.e. won't work for these nodes)

danidi commented 1 year ago

Looks good! Only the upload doesn't work yet, is that intended? I got the following error: Oops! That doesn't look quite right. Please try again. Incorrect number of objects: 5 detected, 0 expected. Are there plans to to include the data mapping at some point? Would be great if the family could be connected to the actual proteins somehow.

khanspers commented 1 year ago

Thanks @danidi! There was a typo in the gpml validation, it is fixed now.

For the data mapping, there is no plan to make that work as far as I know. These instructions were only meant to solve the issue raised in the curation report, basically the alternative to leaving it empty. I can to add a comment to the task that data mapping from individual proteins that are part of the family won't work, and maybe also describe the alternate approach of adding individual proteins as a stack of nodes off to the side of the pathway (like we do with other groupings or genes/proteins)?

khanspers commented 1 year ago

On second thought, Im not sure this should be a stage in the Academy. Although the idea to use an Interpro ID instead of leaving the xref blank is still valid for individual cases (for example the original question by Javi), it's potentially counter-intuitive and confusing as a stage in the Academy since it doesn't enable data mapping at all (at least in PathVisio, or in a straight-forward way in Cytoscape). We can keep this issue open for discussion, but Im not going to add the stage to the path for now.

Chris-Evelo commented 1 year ago

I think that that is fine for now. But it is one of the ideas that often come up in discussions about sequencing data to functionally evaluate sequencing data from multi-species mixtures, e.g. microbiome samples. If we can assign motifs in. sequences to functional protein motifs, and through that to pathways we could in principle evaluate the functionality or the functional capacity of such a mixture without assigning the sequences to species or complete genes. Of course we do not even have complete methods for that yet indeed.