westlake-repl / SaProt

[ICLR'24 spotlight] Saprot: Protein Language Model with Structural Alphabet
MIT License
271 stars 25 forks source link

Fine-tune dataset access #6

Closed lhallee closed 8 months ago

lhallee commented 8 months ago

Hello,

I see you have made .mdb dataset files available. How would one go about simply extracting and using the fine-tune data for downstream tasks? I would like to fine-tune my own model so the training script will not work. Best, Logan

LTEnjoy commented 8 months ago

Hi Logan,

You can try the code below (we use Thermostability as example):

import lmdb
import json

lmdb_dir = "/your/path/to/LMDB/Thermostability/normal/train"
env = lmdb.open(lmdb_dir, readonly=True)
operator = env.begin()
length = int(operator.get("length".encode("utf-8")).decode("utf-8"))
for i in range(length):
    key = f"{i}".encode("utf-8")
    value = operator.get(key)
    data_dict = json.loads(value.decode("utf-8"))
    print(data_dict.keys())
    break

I hope this could solve the problem. Best, Jin

lhallee commented 8 months ago

This works great. Thanks for the help!