wukevin / proteinclip

Contrastive learning harmonizing protein language models and natural language models
https://www.biorxiv.org/content/10.1101/2024.05.14.594226v1
MIT License
24 stars 5 forks source link

How to project text? #3

Open xnought opened 1 month ago

xnought commented 1 month ago

from the paper I see that you first embed text with text-embedding-3-large, then you use your trained projection network from the contrastive learning.

Can you also release the pretrained text project models?

I want to embed text in the joint embedding space and find similar proteins that way.

Any help would be appreciated! Thank you very much. This is a super cool project!

xnought commented 1 month ago

I made an interface too

https://github.com/user-attachments/assets/2609b63f-66fe-4643-8220-d5fc0459e487

If you can help me with the text part I mentioned above, I can add natural language queries to the website.

xnought commented 1 month ago

update: https://ocular.cc.gatech.edu/DS569k/ deployed it in case anyone wanted to use it

xnought commented 3 weeks ago

Also made a Nomic 2d map of 250k proteins using protein clip + topic modeled based on function if you were interested https://atlas.nomic.ai/data/donnybertucci/swissprot-proteinclip/map

Screenshot 2024-10-21 at 1 33 44 PM
young-su-ko commented 3 weeks ago

Also made a Nomic 2d map of 250k proteins using protein clip + topic modeled based on function if you were interested https://atlas.nomic.ai/data/donnybertucci/swissprot-proteinclip/map Screenshot 2024-10-21 at 1 33 44 PM

Which version of proteinclip did you use for this?

xnought commented 3 weeks ago

The smallest one. ESM2 6 layer one.

If you want the data, you can also download it here https://huggingface.co/datasets/donnyb/DS569k. I precomputed all the embeddings and other metadata just so I can reuse it later

young-su-ko commented 3 weeks ago

Thanks! Really fun to just look around the different regions of the 2d map