westlake-repl / SaprotHub

SaprotHub: Making Protein Modeling Accessible to All Biologists
MIT License
71 stars 7 forks source link

A few improvements of code and test scripts and etc. #24

Closed YaoYinYing closed 4 months ago

YaoYinYing commented 4 months ago

Hi,

I have made some code changes to this interesting project to enhance maintainability and integratability with various protein design protocols.

In brief, the changes include:

  1. Adding a GitHub Action CI for testing.
  2. Removing uploaded TMalign and FoldSeek binaries. Instead, two utility modules now handle the binaries, which works for both macOS and Ubuntu.
  3. PretrainedModel: Fetching weights model from Hugging Face.
  4. Introducing basic classes like Mask (handling mask input string and index output) and StructuralSequence (handling amino acid sequence, structural sequence, and SA sequence).

Considering the substantial amount of code changes, it might be improper to contribute them in a single PR. I will submit the PR for your review if you find these changes valuable.

Thank you again for your excellent project!

LTEnjoy commented 4 months ago

Hello Yinying,

Thank you very much for your interest in our work and having done such a lot of work to make it better!

There are too much files being modified and added so we have to carefully evaluate the code before we merge them into the main branch. I'm sure that some of the changes would be beneficial. However, it seems not necessary to wrap a function too strictly, which might lose the flexibility.

For instance, I notice that you defined two classes for TMalign, namely TMalignCompile and TMalignWrapper. I'm not sure is it worth defining too many things rather than a simple function get_tmscore(), which only requires a binary file and two pdb paths. Basically, we want to keep the code as simple as possible for higher maintainability and flexibility.

It's worth noting that we have to ensure users to run SaprotHub smoothly even without any ML and coding background. So what we care most is that how we can improve the interactive interface for better user experience. Given this consideration, further decorations for previous code will introduce more complexity and may not be recommended.

Anyway, we do not doubt that there must be some useful changes from your submissions and we sincerely appreciate your effor to make the project better. Besides, you can add me on Wechat for further discussion. Looking forward to your futher feedback!

Best regards, Jin

YaoYinYing commented 4 months ago

Hi Jin,

Thanks for your reply and pointing these out!

I completely understand the honorable responsibility that you and your team have in reshaping our protein community and it's indeed taugh to make code change on such a scale.

One of my purpose is reducing the deployment works (third party binaries, pretrained weights, etc.) that people might be concern with, especially on their own machines. Moreover, I aim to make it easier for those who, like myself, are trying to connect SaProt to their own protein design workflow.

Now let me explain to you this - I wrote TMalignCompile because I couldn't find a proper binary build for my Mac with a M1 chip, meaning that I had to compile it from source code and handling any potential exceptions during building. It may seem out of place in the codebase, but I couldn't find a better solution. 🥲

However, as for the wrapper class, which was originally inspired from AF2's data tools like HHblits , I acknowledge that it is over-designed, adding redundant and irrelevant code, which complicates maintenance. I truly apologize for this. 🫤

Still, if you find any part of these code changes valuable for your work, feel free to cherry-pick it and adapt them into you codebase!

Best regards, Yinying