westlake-repl / SaProt

Saprot: Protein Language Model with Structural Alphabet (AA+3Di)
MIT License
361 stars 35 forks source link

Downloading pretraining dataset from huggingface #43

Open rubenweitzman opened 4 months ago

rubenweitzman commented 4 months ago

Hi, Thanks for providing the pre-training database with foldseek tokens! having difficulty downloading the dataset and using with hugginface functions. Trying

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("westlake-repl/AF2_UniRef50")

# Load the train split of the dataset
train_dataset = dataset["train"]

but getting error

if not module_name:
    raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in westlake-repl/AF2_UniRef50

What then is the proper way to load in the dataset from huggingface?

LTEnjoy commented 4 months ago

Hi,

AF2_UniRef50 is organized in LMDB format. If you want to load it, you have to first download it and then open the file using lmdb package.

Here is the example of how you get samples:

import lmdb

lmdb_dir = "/your/path/to/AF2_UniRef50/train"
with lmdb.open(lmdb_dir, readonly=True).begin() as txn:
    length = int(txn.get('length'.encode()).decode())
    for i in range(length):
        data_str = txn.get(i.encode()).decode()
        data = json.loads(data_str)
        print(data)
        break

Hope this could resolve your problem:)

heya5 commented 2 weeks ago

@LTEnjoy Hi, can I download the orginal structure data of the sequence?

LTEnjoy commented 2 weeks ago

@LTEnjoy Hi, can I download the orginal structure data of the sequence?

Hi,

I'm sorry but the original structure data is too large to upload so We are unable to share it. You could download all AF2 structures on the official website https://alphafold.ebi.ac.uk/.